Dr. David Jaffe, Broad Institute
High-quality draft assemblies of large and small genomes from massively parallel DNA sequence data
David B. Jaffe, Iain MacCallum, Sante Gnerre, Dariusz Przybylski, Filipe J. Ribeiro, Joshua N. Burton, Bruce J. Walker, Ted Sharpe, Giles Hall, Terrance P. Shea, Sean Sykes, Aaron M. Berlin, Daniel Aird, Maura Costello, Riza Daza, Louise Williams, Robert Nicol, Andreas Gnirke, Chad Nusbaum, Eric S. Lander
Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge MA 02142
Massively parallel sequencing (MPS) technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100 base) sequence reads at very low cost. While such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but 1000 times more expensive) capillary-based sequencing approach.
We report the development of a new algorithm for genome assembly, ALLPATHS-LG, and its application to MPS data from fifteen vertebrate genomes, generated on the Illumina platform: human, mouse (B6 and 129), bushbaby, ferret, shrew, ground squirrel, tenrec, stickleback, coelacanth, and five cichlid fish. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity and coverage of the genome. In particular, the base accuracy is high (>= 99.95%) and the scaffold sizes (e.g. N50 size = 11.5 Mb for human and 17.4 Mb for mouse) are similar to those obtained with capillary-based sequencing. The combination of new sequencing technology and new computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes.
While high-quality assembly of large genomes remains a key challenge of the field, in fact the assembly of small genomes is often challenging, and presently limited by defects in amplification-based MPS data, including read length and uneven coverage. Unamplified single-molecule sequencing data (having complementary properties) can now be generated on the Pacific Biosciences platform. At current yields, this is highly practical for small genomes, for which sample prep costs dominate. Using a modified version of ALLPATHS-LG, we demonstrate hybrid (Illumina plus Pacific Biosciences) assemblies of bacterial genomes. These assemblies are much better than the Illumina-only assemblies of the same genomes. In fact they close nearly all small gaps.
The ALLPATHS-LG program is available at http://www.broadinstitute.org/software/allpaths-lg/blog.
Dr. Florian Markowetz, University of Cambridge
Integrative analysis of breast cancer: Dissecting heterogeneity in samples and signals
I talk about computational methods to address heterogeneity of breast cancer at different levels:
(i) At the *sample* level we often find cancer cells mixed
with immune cells, stromal cells and others. This mixture
of cells leads to a mixture of
signals when DNA, RNA, or proteins are measured in these samples.
I will present an automated and quantitative approach to estimate
cell mixtures and devolute molecular signals.
(ii) On the level of *patients*, different data types (like
copy number alterations and gene expression) can offer complementary
perspectives on
drivers of disease. When integrating these data to identify
homogeneous subpopulations, it is important to distinguish
cases where signals are
concordant from cases where they are contradictory. I will
describe how patient-specific data fusion based on the hierarchical
Dirichlet process can reveal prognostic cancer subtypes.
(iii) On the *population* level, there exist different distinct sub-types of breast cancer and genetic architecture differs between them. When inferring copy-number hotspots and regulatory networks, these sub-types have to be taken into account. In the last part of my talk, I will discuss how penalized regression can elucidate aberration hotspots mediating subtype-specific transcriptional responses in breast cancer.
Dr. Lior Pachter, University of California, Berkeley
Advances in transcript assembly and quantification using RNA-Seq
Adam Roberts, Harold Pimentel, Cole Trapnell, Lior Pachter
In the past year RNA-Seq technology has provided new approaches
to annotating genomes and estimating transcript abundances.
We present
new methods for quantification and guided assembly using a
reference annotation. These methods have been incorporated
into the Cufflinks
RNA-Seq analysis program. Quantification is improved over
previous methods with corrections for both sequence-specific
and positional
biases that arise in the library preparation steps. We have
also added support for multi-reads that map to transcripts
in distinct genomic
locations. These updates allow for more accurate differential
expression analyses with Cuffdiff, which now supports both
biological
and technical replicates. Finally, we introduce a novel approach
to the visualization of RNA-Seq data via Color-My-Reads, which
displays probabilistic assignment of reads to transcripts. These new
features can be applied to a variety of data types including
reads from both SOLiD and Illumina, and the program comes with presets for
many popular library preparation protocols.
Cufflinks is available from http://cufflinks.cbcb.umd.edu/
The talk is available from http://dl.dropbox.com/u/3516588/RECOMB_Satellite_MPS_2011_Pachter.pdf
Invited Speakers
Dr. Chris Greenman, University of East Anglia and The Genome Analysis Centre, BBSRC
Rearrangement Phylogeny in Cancer
Next generation sequencing now allows a complete portfolio of cancer mutations to be constructed, including single nucleotide mutations, allelic copy number and rearrangements. Given these data several questions naturally emerge. Can we assemble the copy number segments into digital karyotypes? Can we identify the rearrangements that have taken place? Can we determine their order? Can we say when these events occurred? Does this inform us about the mutation processes or selection in any way? Whole genome sequences are being published in increasing numbers and such questions may help explain the etiology of cancer. In this talk the power and limits of a graph theoretic approach are explored.
Dr. Christina Leslie, Memorial Sloan Kettering Cancer Center
Inferring transcriptional and microRNA-mediated regulatory programs in glioblastoma
Large-scale cancer genome characterization projects, such as The Cancer Genome Atlas (TCGA) initiative, are currently generating multiple types of high-throughput molecular profiling data for large cohorts of tumors. Although these data sets provide multiple layers of genomewide data for each tumor e.g. DNA copy number, mRNA expression, and microRNA (miRNA) expression many analyses examine each layer independently or combine the layers in somewhat generic or abstract ways. In this work, we present a statistical framework for in-ferring both common and subtype-specific transcriptional and microRNA-mediated dysregulation in cancer from multimodal tumor data sets.
We developed our approach to analyze a data set of 161 TCGA glioblastoma (GBM) tumor samples for which mRNA, microRNA, and copy number profiles were available. TCGA project has categorized samples into three well-defined subtypes proneural, classical, mesenchymal based on a combination of expression phenotype and clinical information. To identify potential transcription factors (TFs) and microRNAs (miRNAs) involved in dysregulation in GBM, we first used motif analysis to identify putative TF binding sites in promoters (includ-ing alternative promoters) and miRNA binding sites in 3' UTRs. We then trained a sparse regression model to predict mRNA expression changes in each GBM sample relative to normal brain from the presence of these regulatory elements and gene copy numbers. Here, sparsity involves imposing a group lasso con-straint so that the models are relatively uniform across samples of a subtype and fewer features (TFs and miRNAs) contribute to the regression model. This con-straint avoids overfitting while also leading to more interpretable results. By cross-validation on held-out genes, we found that these models do indeed ac-count for a significant part of the differential mRNA expression in GBM samples. Moreover, the group lasso approach gave a statistically significant improvement over sample-by-sample lasso models, pointing to the importance of sharing in-formation across subtypes and across the tumor data set. Finally, we filtered miRNAs based on their differential expression relative to normal brain prior to training in order to restrict to more confident candidate regulators.
Statistical analysis of model parameters identified a number of TFs and miRNAs that are dysregulated specific subtypes of GBM. The regulators represented across all subtypes are assigned as common regulators in GBM. We correctly identify known dysregulated miRNAs/TFs, such as upregulation of REST (a re-pressor of neuronal genes in non-neuronal cells) and downregulation of miR-124 (necessary for neuronal differentiation). Our collaborators in the Holland lab (MSKCC) are currently performing experimental validations on a novel set of microRNAs that the model predicts to be dysregulated in GBM. Our statistical framework provides a powerful tool for deriving mechanistic hypotheses about dysregulation of gene regulatory programs in cancer.
Dr. Quaid Morris, University of Toronto
Computational purification of tumour expression profiles
Tumour samples used for RNA expression profiling almost always contain a mixture of tumour and healthy, non-tumour tissue. Physical purification of these samples is difficult, expensive and time-consuming. As such, profiled tumour samples are rarely purified; instead samples are pre-selected for those with high apparent tumour content. In the few studies for which pathological estimates of tumour content are available, even these pre-selected samples are composed, on average, of 30% healthy tissue. To remove this contamination, we have developed ISOLATE-purify, a probabilistic inference algorithm that decomposes the profiles into its tumour and non-tumour components. Unlike similar algorithms, we can define distinct tumour profiles for each patient, thereby computationally purifying each sample profile. Our algorithms only require samples of expression profiles from healthy tissue contaminants. These profiles need not be from the same patient. Applying our algorithms to purify lung and prostate tumour samples allows us to define more accurate predictors of associated clinical outcomes and to reproduce pathologists estimates of sample tumour content.