Fork me on GitHub!


ddClone: Joint statistical inference of clonal populations from single cell and bulk tumour sequencing data

A statistical framework leveraging data obtained from both single cell and bulk sequencing strategies. The ddClone (Salehi et al.) approach is predicated on the notion that single cell sequencing data will inform and improve clustering of allele fractions derived from bulk sequencing data in a joint statistical model.
ddClone combines a Bayesian non-parametric prior informed by single cell data with a likelihood model based on bulk sequencing data to infer clonal population architecture. Intuitively, the prior encourages genomic loci with co-occurring mutations in single cells to cluster together. Using a cell-locus binary matrix from single cell sequencing, ddClone computes a distance matrix between mutations using the Jaccard distance with exponential decay. This matrix is then used as a prior for inference over mutation clusters and their prevalences from deeply sequenced bulk data in a distance-dependent Chinese restaurant process (Frazier and Blei 2012) framework. The output of the model is the most probable set of mutational clusters present and the prevalence of each mutation in the population. The code is based on the ddCRP model, as introduced and implemented in (Frazier and Blei 2012).

Install the package

An easy way to install ddclone is as follows:


A simple example

1. Simulated Data

Load the library:


Run ddClone over simulated data:

ddCloneRes <- ddclone(dataObj =,
              outputPath = './output/dollo.0/', tumourContent = 1.0,
              numOfIterations = 100, thinning = 1, burnIn = 1,
              seed = 1)

Display the result:

df <- ddCloneRes$df
expPath <- ddCloneRes$expPath

Evaluate against the gold standard:

nMut <- length($mutPrevalence)
goldStandard <- data.frame(mutID = 1:nMut,
                           clusterID = relabel.clusters(as.vector($mutPrevalence)),
                           phi = as.vector($mutPrevalence))

Note that in this example the data was packaged in such a way that it contained the gold standard.

Evaluate clustering:

(clustScore <- evaluate.clustering(goldStandard$clusterID, df$clusterID))

Evaluate prevalence estimates:

(phiScore <- mean(abs(goldStandard$phi - df$phi)))

Save the result:

score <- data.frame(clustScore, phiMeanError = phiScore)
write.table(score, file.path(expPath, 'result-scores.csv'))

2. Create a ddclone input object

ddClone’s input object is a list of 3 elements, mutCounts, psi, and filteredMutMatrix. We use the simulated data from the Generalized Dollo model:

intputFilePath <- system.file("extdata", "inputs_simulated.xlsx", package = "ddclone")

Read the genotype-mutation matrix:

genDat <- read.xlsx(file = intputFilePath, sheetName = 'seed1_genotypes', row.names = T)
genDatMutList <- colnames(genDat)

Read the bulk data:

bulkDat <- read.xlsx(file = intputFilePath, sheetName = 'seed_1_allele_counts', row.names = T)
bulkMutList <- as.vector(bulkDat$mutation_id)
rownames(bulkDat) <- bulkMutList

Generate the ddClone compatible data object:

ddCloneInputObj <- make.ddclone.input(bulkDat = bulkDat, genDat = genDat, outputPath = './output/dollo.0/', nameTag = '')

Inspect the data object:

str(ddCloneInputObj, max.level = 1)

Now we can run the analysis similar to sample 1 above.

ddCloneRes <- ddclone(dataObj = ddCloneInputObj,
              outputPath = './output/dollo.0/', tumourContent = 1.0,
              numOfIterations = 100, thinning = 1, burnIn = 1,
              seed = 1)


Frazier, PI, and DM Blei. 2012. “Distance Dependent Chinese Restaurant Processes” 12 (Aug). Journal of Machine Learning Research: 2461–88.

Salehi, Sohrab, Adi Steif, Roth Andrew, Samuel Aparicio, Alexandre Bouchard, and Sohrab P. Shah. “Joint Statistical Inference of Clonal Populations from Single Cell and Bulk Tumour Sequencing Data.” (submitted).