You are here: Welcome » ENCODE Project

ENCODE Project

UCSC 2004

In April 2003, the sequence of the human genome was completed, but much remains to be done. To maximize the information contained in the sequence, the identity and precise location of all of the functional elements in the genome will have to be determined.

These include promoters and other transcriptional regulatory sequences, and determinants of chromosome structure and function such as origins of replication.

This project is assembling a comprehensive encyclopedia of all of these features in a selected 1% of the genome to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat disease.

The NHGRI has created a highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA.

The aim is to examine a diverse set of techniques, technologies and strategies to identify all the functional elements in defined regions of human genomic sequence, to identify gaps in our ability to annotate genomic sequence, and to consider the suitability of such methods to be scaled up for an effort to analyze the entire human genome.

There are several roles for the UCSC Genome Bioinformatics group in this work. We manage the official repository of the sequence-related data for the consortium and support the coordination of data submission, storage, retrieval, and visualization. We also have a special interest in comparative genomics, and are providing additional resources for the ENCODE groups working in this area.

We'd like to thank NHGRI for their support of this project, and to the various contributors of annotations and analyses. The team at UCSC that develops and maintains this ENCODE site is made up of Daryl Thomas, Kate Rosenbloom, Jim Kent, and the UCSC Genome Bioinformatics staff.

News

24 June 2004 - ENCODE Project Portal Released

We are proud to announce the release of features in the UCSC Genome Browser that are tailored to the ENCODE project community, including this home page to consolidate these resources.

The initial resources include sequences for the current human assemblies (hg16, hg15, hg13, and hg12), sequence of the comparative species from NISC, tools for coordinate conversion between human assemblies, format descriptions for data submission, and contact information for help with submitting annotation data and analyses.

Bulk downloads of the sequence and annotations may be obtained from the ENCODE Project Downloads page. The sequences available here are repeat masked versions of the Genbank records.

The sequence and annotation data displayed in the Genome Browser are freely available for academic, nonprofit, and personal use with the following conditions: The general Conditions of Use for the UCSC Genome Browser apply. The ENCODE-specific conditions of use are still being developed and will be displayed here in the future. Until then, please contact us for the conditions on the use of this data. 1)

History and Limitations

The ENCODE Project and the ENCODE Controversy

Stanford - Supplement to Genomics and Postgenomics - Copyright © 2016 by Stephan Guttinger S.guttinger@exeter.ac.uk John Dupré J.A.Dupre@exeter.ac.uk

The ENCyclopedia Of DNA Elements (ENCODE) project was an international research effort funded by the National Human Genome Research Institute (NHGRI) that aimed to identify all functional elements (FE) in the human genome (ENCODE Project Consortium 2004). FEs include, for instance, protein-coding regions, regulatory elements such as promoters, silencers or enhancers and sequences that are important for chromosomal structure. The project, which began in 2003 and included 442 researchers during its main production phase, came to a conclusion in 2012 with the publication of 30 different papers in different journals (ENCODE Project Consortium 2012; Pennisi 2012). Similarly to the HapMap project, ENCODE was presented as the logical next step after the sequencing of the genomic DNA, since tackling the interpretation of the sequences was now seen as the top priority (ENCODE Project Consortium 2004).

The ENCODE project incited a heated debate in academic journals, the blogosphere and also in the national and international press. The crucial claim that incited much ire was the project’s conclusion that 80.4% of the human genomic DNA has a ‘biochemical function’ (ENCODE Project Consortium 2012). To understand the strong reaction this statement provoked we have to turn our focus again to the C-value paradox and the concept of ‘junk DNA’ (see Section 2.3 of the main text).

In the context of the ENCODE controversy this debate was linked with the issue of how to define a ‘functional element’ and how scientists ascribe functions in biological systems. What the ENCODE research implied, at least in the eyes of some commentators, was that the idea of junk DNA was proven wrong, because almost all of our DNA turned out to be functional. This led to claims that textbooks will have to be re-written, as they still describe the genome as mainly composed of junk.[S1] The defenders of the old view claimed that the ENCODE researchers set far too low a bar in ascribing functions to elements of biological systems.

The Methodology of the ENCODE Project

The ENCODE project used a range of different experimental assays to analyse what they referred to as ‘sites of biochemical activity’ (for an overview of the ENCODE output see Qu & Fang 2013). These are sites at which some sort of modification can be identified (for instance methylation) or to which an activity (such as transcription of DNA to RNA) can be ascribed. These modifications or activities were taken as strong indications that the identified regions of the genomic DNA play a functional role in human cells.

As an example of how this approach worked, ENCODE researchers were interested in finding out how much of the genomic DNA is involved in the regulation of gene expression. Researchers postulate that a key hallmark of all regulatory DNA elements is their accessibility. This makes sense as the regulatory and transcriptional machinery need access to these DNA sites. ENCODE used this feature of regulatory DNA to map (putative) regulatory elements in the human genome.

One way to do so is to perform what is called a ‘DNase I hypersensitivity assay’. DNase I is a protein that can cut DNA and this cutting process works better when the template DNA is accessible, meaning that highly accessible regions are more sensitive to DNase I activity. The behaviour of the genome in the DNase I hypersensitivity assay can therefore be used to learn indirectly about its structure, from which researchers then infer the presence of a functional element (in this case a regulatory sequence). This is just one example of about 24 different types of assays that ENCODE researchers used to get a better insight into the number and distribution of functional elements in the human genome (for a discussion of the different types of experimental approaches used in ENCODE see Kellis et al. 2014).

What is interesting about most of these assays is that they look at a proxy for function: if a stretch of DNA is hypersensitive to DNase I then it is automatically defined as functional. Another example is DNA transcription itself. If a DNA sequence shows up in RNA sequencing then this means it has been transcribed into RNA by the enzyme RNA polymerase. This activity, in the eyes of the ENCODE researchers at least, makes the DNA element in question a functional element of the genome.

But such a broad approach to finding out about functional elements is highly problematic, as a transcription event or hypersensitivity can be present for many different reasons (for instance as a result of transcriptional noise). This is exactly what some critics of ENCODE homed in on, pointing out that merely showing the existence of a structure (such as methylation) or a process (such as transcription) is not enough by itself to prove any functional significance of these biochemical features (Doolittle 2013; Eddy 2012; Graur et al. 2013; Niu & Jiang 2013).

Whilst this is surely a valid point that applies to a large part of the research done within ENCODE, not all studies performed as part of the project looked at such proxies. An example is (Whitfield et al. 2012), who did not just look at specific modifications or behaviour of DNA in particular assays but mutated specific sites to check whether the interference with these sites has an effect on gene expression. 2)

NIH - ENCODE Project Common Cell Types

The Encyclopedia of DNA Elements (ENCODE) Project seeks to identify functional elements in the human genome. To aid in the integration and comparison of data produced using different technologies and platforms, the ENCODE Consortium has designated cell types that will be used by all investigators. These common cell types include both cell lines and primary cell types, and plans are being made to explore the use of primary tissues and embryonic stem (ES) cells.

Cell types were selected largely for practical reasons, including their wide availability, the ability to grow them easily, and their capacity to produce sufficient numbers of cells for use in all technologies being used by ENCODE investigators. Secondary considerations were the diversity in tissue source of the cells, germ layer lineage representation, the availability of existing data generated using the cell type, and coordination with other ongoing projects. Effort was also made to select at least some cell types that have a relatively normal karyotype.

The cell types and rationales for their selection are described below: Tier 1:

  • GM12878 is a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation. It was one of the original HapMap cell lines and has been selected by the International HapMap Project for deep sequencing using the Solexa/Illumina platform. This cell line has a relatively normal karyotype and grows well. Choice of this cell line offers potential synergy with the International HapMap Project and genetic variation studies. It represents the mesoderm cell lineage. Cells will be obtained from the Coriell Institute for Medical Research [coriell.org] (Catalog ID GM12878).
  • K562 is an immortalized cell line produced from a female patient with chronic myelogenous leukemia (CML). It is a widely used model for cell biology, biochemistry, and erythropoiesis. It grows well, is transfectable, and represents the mesoderm linage. Cells will be obtained from the America Type Culture Collection (ATCC) [atcc.org] (ATCC Number CCL-243).
  • H1 human embryonic stem cells will be obtained from Cellular Dynamics International [cellulardynamics.com].

Tier 2:

  • HeLa-S3 is an immortalized cell line that was derived from a cervical cancer patient. It grows extremely well in suspension and is transfectable. It represents the ectoderm lineage. Many data sets were produced using this cell line during the pilot phase of the ENCODE Project. In addition, these cells have been widely used in biochemical and molecular genetic studies of gene function and regulation. Cells will be obtained from the America Type Culture Collection (ATCC) [atcc.org] (ATCC Number CCL-2.2).
  • HepG2 is a cell line derived from a male patient with liver carcinoma. It is a model system for metabolism disorders and much data on transcriptional regulation have been generated using this cell line. It grows well, is transfectable, and represents the endoderm lineage. Cells will be obtained from the America Type Culture Collection (ATCC) [atcc.org] (ATCC Number HB-8065).
  • HUVEC (human umbilical vein endothelial cells) have a normal karyotype and are readily expandable to 108-109 cells. They represent the mesoderm lineage. Cells will be obtained from Lonza Biosciences [lonza.com].

Tier 2.5:

  • SK-N-SH
  • IMR90 (ATCC CCL-186)
  • A549 (ATCC CCL-185)
  • MCF7 (ATCC HTB-22)
  • HMEC or LHCM
  • CD14+
  • CD20+
  • Primary heart or liver cells
  • Differentiated H1 cells

Additional information about the ENCODE common cell types, including cell growth protocols being used by ENCODE investigators, is available at http://genome.ucsc.edu/ENCODE/cellTypes.html.

Information on Common Resources used by the ENCODE pilot project, including the pilot project target sequences, BAC Clones for ENCODE targets, cell lines and antibodies to DNA-binding proteins can be found at www.genome.gov/12513455/encode-pilot-project-common-consortium-resources/. Last Updated: March 9, 2012 3)

NATURE 2014 Progress Review

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification.

These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation.

The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research. 4)

=== Transcribed and protein-coding regions === 5)

Cancer Genes

6)

Project ENCODE Users Guide

A User's Guide to the Encyclopedia of DNA Elements (ENCODE)

The ENCODE Project Consortium PLOS - Published: April 19, 2011

DOI: 10.1371/journal.pbio.1001046

Abstract

The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns.

In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

Author Summary

The Encyclopedia of DNA Elements (ENCODE) Project was created to enable the scientific and medical communities to interpret the human genome sequence and to use it to understand human biology and improve health.

The ENCODE Consortium, a large group of scientists from around the world, uses a variety of experimental methods to identify and describe the regions of the 3 billion base-pair human genome that are important for function. Using experimental, computational, and statistical analyses, we aimed to discover and describe genes, transcripts, and transcriptional regulatory regions, as well as DNA binding proteins that interact with regulatory regions in the genome, including transcription factors, different versions of histones and other markers, and DNA methylation patterns that define states of the genome in various cell types.

The ENCODE Project has developed standards for each experiment type to ensure high-quality, reproducible data and novel algorithms to facilitate analysis. All data and derived results are made available through a freely accessible database. This article provides an overview of the complete project and the resources it is generating, as well as examples to illustrate the application of ENCODE data as a user's guide to facilitate the interpretation of the human genome. 7)

Back to top