Student Wiki on methodology

This Wiki is intended to collectively make the point on methodologies employed in research papers we analyze during the course. "Writers" are students who wish to contribute to a specific subject. Before contributing, please add your name in the "Writers group choice". When initiating a contribution, please indicate your name in brackets.

PLEASE:  DO NOT change the INDEX page !!!
This page contains the links to the nine official subjects, which are the same in the Choice.

To contribute, go to the correct page by clicking on the description here in the index, then click EDIT and contribute. At the end, please save.


Please do not make extensive cut-and-paste: it s useless, anybody can go to the source you use and read it.  Read the texts, digest, and make a short résumé. If you wih you can include link(s) to the source(s).

Other contributors can revise, add, erase, modify...   Please do not repeat the same text as well. 

Transcriptome: special techniques, RNA-Seq, GRO-Seq, CAGE, others.

Wiki error: Section name cannot be repeated 'Limitations'

Back to index


(Author: Ilaria Ferrarotto)

Transcriptome Analysis is the study of the transcriptome, of the complete set of RNA transcripts that are produced by the genome, under specific circumstances or in a specific cell, using high-throughput methods.

Transcriptome analysis by next-generation (RNA-seq) sequencing allows investigation of a transcriptome at unsurpassed resolution, detecting both coding and regulatory transcripts, like siRNA and lncRNA. One major benefit is that RNA-seq is independent of a priori knowledge on the sequence under investigation, thereby also allowing analysis of poorly characterized species. 

Brief outline of the workflow:

  1. bulk RNA is extracted from the sample and the desired RNA is selected (sample preparation)
  2.  the selected RNA is copied into stable double-stranded copy DNA (library construction)
  3.  the ds cDNA is then sequenced using various sequencing methods
  4.  the sequences obtained can are aligned to reference genome sequences, available in data banks, to identify which genes are transcribed. This type of analysis provides a quantification of the expression levels for the transcribed genes. Alternatively, RNA-seq can be used to identify alternative splicing, novel transcripts, and fusion genes, following a new transcript discovery approach.

The complete workflow of RNA-seq consists of: (1) experimental design; (2) sample and library preparation; (3) sequencing; and (4) data analysis. You will find a general explanation of each step in the following video.

For a deeper understanding of the RNA-seq technology and its applications follow these links:


Global Run-On sequencing is a high-throughput evolution of the Nuclear Run-On assay, introduced over 40 years ago, coupled to deep sequencing.

The advantage of this protocol is the exceptional sensitivity and the possibility to map nascent transcripts at the genome-wide scale providing a reliable and unbiased, real-time measure of transcriptional activity from engaged RNA polymerase in mammalian cells; in fact the steady-state level of RNA, measured by conventional sequencing methods, does not accurately mirror transcriptional activity per se.

Moreover it delivers a high-resolution map of coding and noncoding transcripts that is especially useful for annotation and quantification of short-lived RNA molecules, usually hard to detect because, owing to their instability, these transcripts do not accumulate in the nucleus and elude most RNA detection protocols.

For example, with this method it has been recently characterized enhancer-associated RNAs (eRNAs) and their transcription in response to stimuli such as estrogen, LPS and Epidermal Growth Factor; we have achieved crucial information on RNA polymerase II (RNAPII) such as density at different classes of protein coding genes, defects in elongation, pause-release and termination and the capacity to fire bi-directionally at most mammalian promoters, initiating noncoding RNAs that are transcribed antisense with respect to the messenger RNA.

Limitations: laboriousness of the technique and the amount of starting material (the number of cells that are required lies in the 10ˆ7 range)


  • Nuclei isolation: Nuclei from mammalian cells are isolated, washed to remove free nucleotides and kept at ice-cold temperature to arrest ongoing transcription;
  • Nuclear Run-On: Transcription is resumed in vitro when nuclei are incubated at 30°C in the presence of brominated nucleotides and the anionic detergent sarkosyl, which prevents de novo assembly of the pre-initiation complex and avoids re-initiation;

  • Elongation: Transcripts that were initiated at the time of nuclei isolation (commonly referred to as nascent RNA ) will be further elongated by engaged RNA polymerase, to allow incorporation;

  • Firts immunoprecipitation: affinity purification by means of commonly used antibodies against bromodeoxyuridine (anti-BrdU);

  •  End repair;

  • Second immunoprecipitation;

  • Adapter ligation;

  • Third immunoprecipitation;

  • Library preparaton: isolate nascent RNA can be ultimately converted into a Illumina-compatible DNA library suitable for deep sequencing;


  • GRO-seq, A Tool for Identification of Transcripts Regulating Gene Expression, March 2017, Methods in Molecular Biology 1543:45-55, DOI: 10.1007/978-1-4939-6716-2_3
(GRO-Seq written by Fabiola Campestre)

CAGE -seq [edit]

The begging

Moving from Sanger to next-generation sequencing, the refinement of CAGE technology has gone alongside the development of sequencing technology, which clearly gave us the power to characterize RNA better than before.


CAGE stands for Cap-Analysis gene expression, that means it analyzes 5' cap of mRNA, but not only, it helps to identify and quantify the transcriptional start sites (TSSs), within promoters are characterized at single nucleotide resolution. CAGE allows to map of all the initiation sites of both capped coding and noncoding RNAs. Even to identify novel regulatory elements, the predictions of transcription factor binding sites and motifs associated with transcription.

The analysis of 5’ ends by CAGE, in eukaryotes, it is suitable to imply gene regulatory networks and it has provided knowledge of the key transcription factors responsible for the differentiation of cell, for instance of monoblasts to monocytes (Suzuki H, et al.).

Deeper sequencing is necessary to detect all active promoters in a given tissue, for instance in mammalian cells, since they have at least 5–10 time more TSS. CAGE  was used to discover promoter activity from small subpopulations of hippocampal cells (Valen, Eivind (2009). 

ENCODE project at NIH is one of the most important database that use this technique. 

The picture below show us the general workflow of CAGE.

workflow of CAGE seq. it enable us to code the sequence of RNA and non coding RNA.

Also CAGE allow the operator to observe that retrotransposon elements are specifically expressed and act as regulators of protein coding RNAs and other ncRNAs. 

How it works

The CAGE utilizes a “cap-trapping” technology based on the biotinylation of the 7-methylguanosine cap of Pol II transcripts, to pull down the 5’-complete cDNAs reversely transcribed from the captured transcripts. Through a massive sequencing of the 5’ end of cDNA and analysis of the sequenced tags, transcription start sites and transcripts amount are inferred on a genome-wide scale.

CAGE library preparation

The main steps of CAGE are:

  1. reverse transcription with random primer mixture to make cDNA
  2. Biotinylation: biotin hydrazide, generated by oxidation process 
  3. ssRNA digestion with RNAse1
  4. Capture of the fragments by magnetic beads
  5. wash away
  6. Released cDNA from mRNA by denaturation
  7. single strand linker ligation in which the raptor carries barcode at the 3' end of cDNA
  8. Single strand linker ligation at 5' end cDNA
  9. 2nd stand synthesis by longer linker primer
  10. loaded on the instrument and sequenced
Analyzing data

The primary output of CAGE is a set of sequences, each of which represents a short reads corresponding to the 5’ end of capped RNA molecules, also called CAGE tags.  after that will follow the computational processing from which we can obtain a mapping, so genomic location, clustering aggregation into a unit of transcriptional initiation on genome and tags activity or expression level.


  • Measures RNA expression levels
  • Maps TSS in promoter regions at single-nucleotide resolution
  • Discover alternative promoters

  • Only works on total mature RNA
  • CAGE selectively removes non-capped RNAs
  • CAGE is not applicable to prokaryotes or RNAs shorter to 100 nt

(written by Dante Davide)


Hazuki Takahashi, Timo Lassmann, Mitsuyoshi Murata, and  Piero Carninci5’ end-centered expression profiling using Cap-analysis gene expression (CAGE) and next-generation sequencing, 2012, Nature Protocol, 542- 561

Valen and Eivind, Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE, 2009, Genome Research, 255–265.

Rimantas Kodzius et al., CAGE: cap analysis of gene expression, Nature Methods, 2006, 211222.

name="toc-10">SAGE seq [edit]

The Begging
In 1995 SAGE was described, and initially the method was originally developed to investigate genes that might be differentially expressed in colon cancer.
With the advent of the human genome project, a vast amount of information about genes and gene structure is suddenly at our fingertips. But this information is limited. Every cell within an organism has the same genetic composition, exeption made for gametes, and yet, obviously skin tissue is very different from nervous tissue.
In this way, a given DNA sequence only provides information about what could be, not what actually is.
Serial analysis of gene expression (SAGE) is a powerful genome-wide gene-expression profiling approach utilized for the characterization of transcriptomes.
We can say SAGE is a technique that allows rapid, detailed analysis of thousands of transcripts in a cell.
The basic concept of SAGE rests on two principles:
  1. A small sequence of nucleotides from the transcript, called a ‘tag’, can effectively identify the original transcript from whence it came.
  2. Linking these tags allows for rapid sequencing analysis of multiple transcripts.
How SAGE works
The principle is to isolate a unique sequence tags (9–10 bp in length) from individual mRNAs and linkage of tags serially into long DNA molecules for lump-sum sequencing. It works by isolating short fragments of genetic information from the expressed genes that are present in the cell being studied. These short sequences, called SAGE tags, are linked together for efficient sequencing. The frequency of each SAGE tag in the cloned multimers directly reflects the transcript abundance. It also helps in identifying a set of specific genes to the cellular conditions by comparing the profiles constructed for a pair of cells that are kept at different conditions. SAGE cannot identify mRNA 5′ ends, which may lie tens or hundreds of kilobases upstream in the genomic sequence. Nevertheless, SAGE is suitable only for obtaining 3′ end sequencing information for counting transcriptional units (TUs).
Today if we look at cases of yeast and cancer transcriptome, databases based on SAGE are already accessible via the internet.
Few steps for SAGE
Fig. beneath shows a schematic diagram of each of the steps in SAGE.
We can therefore broke down the diffetn steps:

  1. Isolate RNA: DNA extraction is a critical first step, a complimentary DNA strand, or cDNA, of each transcript in the cell must be generated. This is necessary, since mRNA is much less stable than DNA.
  2. Perform PCR
  3. Perform sequencing reaction
  4. Purify the sequencing reaction: it is important to remove unincorporated dye terminators and salts that may compete
  5. Perform capilalry electrophoresis
  6. Analyze data

Analyzing data
We can see a typical panel reuslts under here.
A typical results of SAGE

  • mRNA sequence doesn't need to be known prior, so genes of variants whihc are not known can be discovered.
  • its more accurate as it involves direct counting of the number of transcript.


  • teh length of geen tag is extremly short, about 13 bp, so if the tag is derived form and unknown gene, it's difficult to analyze with such a short sequence.
  • Type 2 restriction enyme doesn't yield same length fragments.
  • mRNA levels and protein expresison are not always correlated.
In 2002 was created a novel SAGE technique called Long-SAGE, was a more robust version of the original. It had a higher throughput, first, a small amount of mRNA (50 ng) was enough for a library construction. Second, enhancement of cDNA adapter and ditag formation was achieved through an extended ligation period (overnight). Third, only 20 ditag polymerase chain reactions were needed to obtain a complete library, and fourth greatly improving cloning efficiency thanks to a new endonuclease NlaIII.
Long-SAGE releases 21 bp from each transcript, using different enzyme. So, longer tags were much more efficient for the identification of novel genes in the complex genomes in comparison with conventional SAGE tags.

(written by Dante Davide)
Min H. and Kornelia Polyak, Serial analysis of gene expression, Nature Protocol, 2006, pages17431760