Model-based clustering of array CGH data.

Sohrab P Shah, K-John Cheung, Nathalie A Johnson, Guillaume Alain, Randy D Gascoyne, Douglas E Horsman, Raymond T Ng, Kevin P Murphy, Bioinformatics (Oxford, England) 25, i30-8 (2009) 2009


Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population.

We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model based: each cluster is defined in terms of a sparse profile, which contains the locations of unusually frequent alterations. The profile is represented as a hidden Markov model. Samples are assigned to clusters based on their similarity to the cluster’s profile. We simultaneously infer the cluster assignments and the cluster profiles using an expectation maximization-like algorithm. We show, using a realistic simulation study, that our method is significantly more accurate than standard clustering techniques. We then apply our method to two clinical datasets. In particular, we examine previously reported aCGH data from a cohort of 106 follicular lymphoma patients, and discover clusters that are known to correspond to clinically relevant subgroups. In addition, we examine a cohort of 92 diffuse large B-cell lymphoma patients, and discover previously unreported clusters of biological interest which have inspired followup clinical research on an independent cohort.

Software and synthetic datasets are available at approximately sshah/acgh as part of the CNA-HMMer package.

Supplementary data are available at Bioinformatics online.