Data Info

Data Visualization




Advances in high-throughput technologies allow for measurements of many types of omics data, yet the meaningful integration of several different data types remains a significant challenge. Another important and difficult problem is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. Here we present a novel approach, called Perturbation clustering for data INtegration and disease Subtyping (PINS), which is able to address both challenges. The method is based on the observation that small changes in quantitative assays will be inherently present between individuals, even in a homogeneous population. Therefore, if distinct molecular subtypes do exist, they must be stable with respect to small changes in quantitative assays. In order to discover reliable subtypes from molecular data, we estimate how often each pair of patients is grouped together in the following scenarios: i) when the data are perturbed (by adding Gaussian noise), ii) when using different data types, and iii) when using different clustering techniques. We then partition patients into subgroups that are strongly connected in all scenarios.

Figure 1: Data integration and disease subtyping illustrated on the kidney renal clear cell carcinoma (KIRC) dataset. (a–b) The input consists of three matrices that have the same set of patients but different sets of measurements. (c) The optimal connectivity between the samples for each data type. Group 1 is further split into two subgroups in stage II. (d) Kaplan-Meier survival curves of 4 subtypes after Stage II splitting of group 1. The survival analysis indicates that the 4 groups discovered after Stage II have significantly different survival profiles (Cox p-value 6e-5).


PINSPlus optimizes two algorithms of PINS (Nguyen et al., 2017; Nguyen, 2017): (i) PerturbationClustering() to cluster a single data type and SubtypingOmicsData() to integrate omics data. The algorithms calculate the difference between the original and the perturbed connectivity matrices and computes the empirical cumulative distribution functions of the difference matrix (CDF-DM). The area under the CDF-DM curve AUCk is used to assess the stability of the partitioning. Howerver, the AUC values tend to converge after a certain number of iterations, which means that at some point, additional iterations are not necessary. Therefore, PINSPlus makes use of this advantage in order to determine an early stopping point for the perturbation clustering. In addition, PINSPlus makes use of multi-core processing to speed up the perturbation processing. PINSPlus implements multi-core feature in a way such that the result is stable regardless of the number of cores being used.

Figure 2: AUC values after each iteration in perturbation processing for KIRC dataset. Each line indicates AUC values for each k from2 to 5. The ▲ symbols indicate the early stop point for each k in PINSPlus.


SMRT improves two algorithms of PINS (Nguyen et al., 2017; Nguyen, 2017; Nguyen et al., 2019): i) PerturbationClustering() to cluster a single data type, and ii) SubtypingOmicsData() to integrate omics data. When only a single data type is available, SMRT uses SMRT.Single() function to perform sub-typing. The method is super fast, highly scalable, and can subtype hundreds of thousands of samples in under three minutes. We implement an ensemble strategy to optimize the running time while maintaining patient partitioning performance. We use singular value decomposition (SVD) and randomized singular value decomposition (RSVD) to project the original data to lower dimensional space. Then, we repeatedly perturb the subspace data by adding Gaussian noise and cluster the patients using different cluster numbers. The clustering assignment that gives the best agreement between the perturbed and original data yields the optimal subtype. When the dataset contains a large number of patients, we perform subtyping on a subset of size 2,000 and map the unpartitioned patients to the closest subtype using K nearest neighbors (KNN) algorithm. We use the SMRT.Multi() function when muti-omics data types are available. The method performs multiple stages of perturbation clustering and outputs patient connectivity graphs for each data type. The graphs that are resilient across all data types yield the most agreed number of patient subtypes. We extend the functionality of SMRT by building a web-application that allows users to perform and visualize patient subtyping via an online portal. The web portal is very useful for users who have limited computational resources.

Figure 3: The benchmarking results of SMRT. (A) Benchmarking results of SMRT compared with SNF, CIMLR, NEMO, moCluster, iClusterBayes, LRACluster, MCCA, and IntNMF using 39 TCGA and METABRIC datasets. SMRT outperforms other methods by having the highest average -log10(p-value). (B) Comparison of running time of SNF, CIMLR, NEMO, moCluster, and SMRT using simulation dataset. Here, we simulate the dataset with three known subtypes and gradually increase the number of patients and features. SMRT is the fastest and most scalable method.

About this Application


This application exists to demonstrate SMRT and make its underlying implementation more accessible. Users can select one of the built-in datasets, or may also upload their own datasets. A Principal Component Analysis (PCA) plot will be generated for the chosen data. Users may then select 'Start Analysis' to begin clustering using SMRT. The results of the clustering are displayed in the plot, and users can download this data in csv and image form.


Nguyen, H., Shrestha, S., Draghici, S., & Nguyen, T. (2018). PINSPlus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics.

Nguyen, T., Tagett, R., Diaz, D., & Draghici, S. (2017). A novel approach for data integration and disease subtyping. Genome research, 27(12), 2025-2039.


The Cancer Genome Atlas datasets (TCGA) and the Molecular Taxonomy of Breast Cancer International Consortium datasets are saved in ".RData" format and can store multiple data types. We recommend users perform muti-omics sutyping using ".RData" by PINSPlus R package. If users want to do so in the web application, it is recommended to convert ".RData" format to ".csv" or "rds" format using the provided R script and upload multiple data types at the same time for analysis.

generate_csv.R generate_rds.R

The Cancer Genome Atlas data

The processed datasets are from The Cancer Genome Atlas datasets (TCGA) website ( and Firebrowse website ( The datasets include Kidney renal clear cell carcinoma (KIRC), Glioblastoma multiforme (GBM), Acute Myeloid Leukemia (LAML), Lung squamous cell carcinoma (LUSC), Bladder Urothelial Carcinoma (BLCA), Head and Neck squamous cell carcinoma (HNSC), Liver hepatocellular carcinoma (LIHC), Stomach adenocarcinoma (STAD), Thymoma (THYM), Glioma (GBMLGG), Brain Lower Grade Glioma (LGG), Pancreatic adenocarcinoma (PAAD), Skin Cutaneous Melanoma (SKCM), Colorectal adenocarcinoma (COADREAD), Uterine Corpus Endometrial Carcinoma (UCEC), Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), Colon adenocarcinoma (COAD), Breast invasive carcinoma (BRCA), Stomach and Esophageal carcinoma (STES), Kidney renal papillary cell carcinoma (KIRP), Kidney Chromophobe (KICH), Uveal Melanoma (UVM), Adrenocortical carcinoma (ACC), Sarcoma (SARC), Mesothelioma (MESO), Rectum adenocarcinoma (READ), Uterine Carcinosarcoma (UCS), Ovarian serous cystadenocarcinoma (OV), Esophageal carcinoma (ESCA), Paraganglioma (PCPG), Lung adenocarcinoma (LUAD), Prostate adenocarcinoma (PRAD), Thyroid carcinoma (THCA) and Testicular Germ Cell Tumors (TGCT).

TCGA datasets in RData format

TCGA datasets in rds format

The Molecular Taxonomy of Breast Cancer International Consortium datasets

The processed datasets from European Genome-Phenome Archive ( and cBioPortal (

METABRIC datasets in RData format

METABRIC datasets in rds format

Analysis with SMRT

To run SMRT with provided datasets, we provide the package and scripts which can be download here:

Please follow instructions in
to run the provided scripts.

How to Use

Welcome to the SMRT web application. This tool is designed for researchers and other interested parties to test out the SMRT methodology without requiring installation of the R package. For those interested in the R implementation of SMRT, please see 'R Package Instructions' below. For all others, see 'Application instructions' for a walkthrough on how to use this app.

Application Instructions

Step 1 - Select Data

To begin, select the "Visualization " on the main menu at the left side. After that, select the dataset you would like to cluster using SMRT in the "Data Upload and Analysis" panel at the left side. There are two options available using the left-hand side menu:

(1) Choose one of the built-in datasets from the 'Dataset' dropdown.

If you do not have your own data to use with the website, you can select one of two buit-in datasets to perform analysis. The first dataset is AML2004 that contains 38 samples including 11 acute myeloid leukemia, 19 acute lymphoblastic leukemia B cell, and 8 T cell. The second dataset is KIRC that contains 123 gene expression data samples of kidney renal clear cell carcinoma.

(2) Choose a local CSV file to upload.

If uploading data, the format should follow that of the example data. That is, samples should make up the rows, and features should make up the columns. Once data is selected, you will see a progress bar indicating when file upload is complete. Note the options available on the left-hand menu for defining the csv header and separator options if necessary. If successful, that data will now be represented in tabular form at the top of the app, above a PCA plot of the same data.

To perform clustering using multiple data types, do the following: When selecting files to upload, ensure all desired files are selected simultaneously in your file browser. Each file will store one data type respectively. After choosing the files, this will upload and process all chosen datasets concurrently.

To add clinical information to the analysis, clinical data should be uploaded with other data types. Note that the file name should be exactly as 'clinical.csv', and clinical information is only used for visualization.

(3) Choose a local rds file to upload.

If you are more familiar with .rds data format, you can upload your own data follows our conventional format. The .rds file should store a list of data matrices such that each matrix represents one data type. The matrix should follow the standard that the samples should make up the rows, and features should make up the columns. Once data is selected, you will see a progress bar indicating when file upload is complete.

When you have the data uploaded, the web application will automatically render the data viewer table as well as a 2-D visualization for each data type separately in the middle panel.

To add clinical information to the analysis, clinical data should be in the list of data matrices with other data types. Note that the clinical matrix name should be exactly as 'clinical', and clinical information is only used for visualization.

Step 2 - Start Analysis

Use the 'Data Info' box on the upper right-hand side to confirm that the desired data has been properly loaded. Then, start SMRT clustering using the 'Start Analysis' button. Once processed, the application will return the following:

(1) - The optimal number of clusters as determined by SMRT (displayed in the 'Data Info' box)

(2) - An updated PCA plot that includes the color-coded cluster assignments

(3) - Three buttons beneath the plot for various export functionality (see next section)

Step 3 - Explore Results

The PCA plot may be explored interactively as the user desires. Note the included legend denoting the cluster-color relationships.

The clustering results may also be downloaded in tabular form. Click the 'PCA' data button to download the original, unclustered PCA data. Use the 'Clustering Result' button to download a list of the cluster assignments for each sample. Lastly, an image of the plot may be downloaded by clicking the 'PCA Plot' button (see the next section for customization details).

PCA Plot Settings

The 'PCA Plot Settings' panel provides some additional options for customizing the PNG plot output. Height, width, and DPI of the plot may be adjusted here.

Connectivity Matrix

This tab provides heat maps of connectivity matrices for different subtypes discovered by SMRT ranging from 2 to 5.

R Package Instructions

Install PINSPlus

SMRT is integrated into PINSPlus R package. The latest version of PINSPlus package can be installed from CRAN repository using the command below:

> install.packages("PINSPlus") 
> library(PINSPlus)

Package Documentation

For detailed documentation on the PINSPlus R Package, including detailed usage instructions, click the link below:


If you find our software useful to your work, please cite our software using the citation below:

Tran, B., Tran, D., Nguyen, H., Cassell, A., Dascalu, S., Draghici, S. & Nguyen, T. (2020). Randomized data transformation for cancer subtyping and big data analysis.

Nguyen, H., Shrestha, S., Draghici, S., & Nguyen, T. (2018). PINSPlus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics.

Nguyen, T., Tagett, R., Diaz, D., & Draghici, S. (2017). A novel approach for data integration and disease subtyping. Genome research, 27(12), 2025-2039.