CAGE-seq

Warning

under construction

Integration with FANTOM5 data

From a user view point it will be more convenient when CAGE-seq data will be integrated one with other and with FANTOM5 data.

Main ideas are following:

  • to build merged set of TSSs from FANTOM5 and obtained data;

  • when TSSs are overlapping in FANTOM5 and obtained data, then maximally preserve FANTOM5 TSSs and in rare cases add new TSSs. This is the most sophisticated part. See its description below.

  • using the joined set of TSSs to build the joined table of TSSs expressions using FANTOM5 and new data. For this purpose we are recalculating TSSs expressions using their joined set. Due to the suggested approach all FANTOM5 data will be preserved and seamlessly extended by the new data.

  • annotate new TSSs using the same approach as FANTOM5. All FANTOM5 data also will be preserved and seamlessly extended by the new data.

  • while FANTOM5 does not provide enhancers and their annotation for CAGE-seq data for rat, we are building a set of enhancers using both FANTOM5 and our data. to provide a unique ID for each TSS, enhancer and promoter so a user can unambiguously refer to them.

The suggested approach can be used for incremental integration of other CAGE-seq data with FANTOM5 data.

Organisms

The GTRD database and FANTOM5 contains corresponding data for different organisms.

The table below shows species in the GTRD database and FANTOM5.

Currently we include in the GTRD database CAGE-seq from FANTOM5 for 3 common species only: Homo sapiens, Mus musculus and Rattus norvegicus.

Species

GTRD

FANTOM5

Comment

Arabidopsis thaliana

Caenorhabditis elegans, nematode

Danio rerio, zebrafish

Drosophila melanogaster, fruit fly

Homo sapiens, human

Mus musculus, mouse

Rattus norvegicus, rat

Saccharomyces cerevisiae, budding yeast

Schizosaccharomyces pombe, fission yeast

Canis familiaris, dog

possibly will be added in GTRD next releases

Gallus gallus, chicken

possibly will be added in GTRD next releases

Macaca mulatta, rhesus macaque

possibly will be added in GTRD next releases

FANTOM5 data import

Table below specifies FANTOM5 files that were imported into the GTRD database.

Species

FANTOM5 directory, files

Comments

Homo sapiens

https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_v8/extra/

Mus musculus

https://fantom.gsc.riken.jp/5/datafiles/reprocessed/mm10_v8/extra/

Rattus norvegicus

https://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks

CAGE peaks annotation

If CAGE peak is located near with known gene TSS then it is annotated as promoter.

Also such peak is annotated with hyperlinks to corresponding transript/gene/protein identifier in the external database.

For each specie its own set of databases is used as shown in table below.

Column comments:
  • Database, entity - database name and correspondin entity type (transript/gene/protein)

  • FANTOM5 - indicates whether references to this database are used in annotation by FANTOM5

  • GTRD, FANTOM5 - indicates whether references to this database are used in GTRD for CAGE peaks imported from FANTOM5

  • GTRD, new peaks - indicates whether references to this database are used for annotation of new CAGE peaks in GTRD

Species

Database, entity

FANTOM5

GTRD, FANTOM5

GTRD, new peaks

Description, comment

Homo sapiens

Ensembl, transcript id

Ensembl, gene id

duplicates cage_promoters2transcripts?

Entrez, gene id

HGNC, gene id

FANTOM5, peak id

duplicates cage_peaks_xx.fantom5_id?

Associated gene name Associated gene symbol Associated gene synonym Source of associated gene

Mus musculus

Ensembl, transcript id

Ensembl, gene id

duplicates cage_promoters2transcripts?

Entrez, gene id

FANTOM5, peak id

duplicates cage_peaks_xx.fantom5_id?

Associated gene name Associated gene symbol Associated gene synonym Source of associated gene

Rattus norvegicus

Ensembl, gene id

Ensembl, transcript id

duplicates cage_promoters2transcripts?

Ensembl_gene_name Augustus_gene_name RefSeq_transcript_id Geneid_gene_name Genscan_gene_name EST_gene_name

?

Database schema

Results of analyses of CAGE-seq data presented as a set of SQL tables.

These tables are populated for each species, where xx is suffix that specifies specie and genome build.

Table suffix

Species

Genome build

hg38

Homo sapiens, human

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26

mm10

Mus musculus, mouse

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.20

rn6

Rattus norvigicus, rat

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001895.5

ER diagram

   hide circle    skinparam linetype ortho     entity "cage_peaks_xx" as peaks {      *id : int(10)      --      chrom : varchar(10)      start : int(10)      end : int(10) unsigned       strand : char(1)      abs_summit : int(10)      name : mediumtext      type : varchar(255)      organism : varchar(50)      gene_symbol : mediumtext      enhancer_id : int(10)      fantom5_id : mediumtext      version : int(10)      source : varchar(255)    }     entity "cage_peaks_expression_xx" as expression {      peak_id: int(10)      experiment_id: int(10)      value : float    }     entity "cage_promoters_annotation_xx" as annotation {      peak_id: int(10)      external_db: varchar(30)      external_db_id: mediumtext    }     entity "cage_promoters2transcripts_xx" as transcripts {      peak_id: int(10)      ensembl_transcript_id: varchar(30)      distance: int(11)    }     entity "cage_enhancers_xx" as enhancers {      *id : int(10)      --      chrom : varchar(10)      start : int(10)      end : int(10) unsigned       type : varchar(255)      name : mediumtext    }     entity "cage_enhancers2genes_xx" as enhancers2genes {      enhancer_id : int(10)      ensembl_gene_id : varchar(20)      type : enum('corr','prox')      distance : int(11)      corr : float      padj : float    }     peaks }o..|| annotation    peaks }o..|| transcripts    peaks }o..|| enhancers    peaks }o..|| expression    enhancers }o..|| enhancers2genes

Database schema for CAGE-seq data

Table cage_peaks_xx

These tables describes CAGE peaks (TSSs - Transcription Start Sites).

Field

Type

Index

Description

id

int(10)

primary

internal CAGE peak (TSS) identifier

chrom

varchar(10)

chromosome according to coresponding genome build (xx) Example: chr1.

start

int(10)

start of CAGE peak region

end

int(10)

end of CAGE peak region

strand

char(1)

TSS strand (+ or -)

abs_summit

int(10)

start of the representative TSS position

name

mediumtext -> varchar

?

name (ID) of the CAGE peak Example: hg19::chr1:564571..564600,+;hg_1.1

type

varchar(255)

?

type of CAGE peak: promoter or enhancer

organism

varchar(50)

organism. ? remove

gene_symbol

mediumtext -> varchar

gene symbol for CAGE peaks that are promoters

enhancer_id

int(10)

id of enhancer that includes this CAGE peak

fantom5_id

mediumtext -> varchar

CAGE peak id in FANTOM 5

version

int(10)

CAGE peak version

source

varchar(255)

?

Source (database or project) from which this CAGE peak was imported. Possible values: FANTOM 5

Table cage_promoters_annotation_xx

Some CAGE peaks are annotated as promoters. Corresponding links are stored in cage_promoters_annotation_xx tables.

Field

Type

Index

Description

peak_id

int(10)

internal CAGE peak (TSS) identifier

external_db

varchar(30)

External database name

external_db_id

mediumtext -> varchar

Corresponding transript/gene/protein identifier in the external database

Table cage_promoters2transcripts_xx

Some CAGE peaks are annotated as promoters. Tables cage_promoters2transcripts contain distance and id for nearest transcripts (distance <500 bp).

Field

Type

Index

Description

peak_id

int(10)

mul

internal CAGE peak (TSS) identifier

ensembl_transcript_id

varchar(30)

distance

int(11)

Table cage_enhancers_xx

Some CAGE peaks are joined into enhancers.

Field

Type

Index

Description

id

int(10)

primary

internal enhancer identifier

chrom

varchar(10)

chromosome according to coresponding genome build (xx) Example: chr1.

start

int(10)

start of enhancer region

end

int(10)

end of enhancer region

type

varchar(255)

Source (database or project) from which this enhancer was imported. Possible values: FANTOM 5 rename to sourse

name

mediumtext -> varchar

name (ID) of the CAGE peak Example: chr10:100006233-100006603

Table cage_enhancers2genes_xx

These tables describe what genes are potentially regulated by enhancers.

Field

Type

Index

Description

enhancer_id

int(10)

internal enhancer identifier

ensembl_gene_id

varchar(20)

type

enum(‘corr’,’prox’)

type of possible promoter-enhancer interaction

distance

int(11)

distance between promoter-enhancer, for type ‘prox’

corr

float

expression correlation between promoter-enhancer, for type ‘corr’

padj

float

adjusted P value for expression correlation between promoter-enhancer, for type ‘corr’