X Tutup

Data Curation in Bgee

Bgee is a database for retrieval and comparison of gene expression patterns across multiple animal species. Bgee is based exclusively on curated healthy wild-type expression data to provide a comparable reference of gene expression. The data included in Bgee undergoes manual curation to ensure all metadata captured is accurate and reliable. For this purpose, ontologies are used to annotate anatomical entities, developmental stages, and cell types to support simple comparisons between species and the propagation of information.

Introduction

How are experiments identified for annotation?

We use a few different methods for identifying experiments for annotation. We use a python and an R script to mine the SRA for experiments to annotate. We also accept external and internal requests for annotation, in addition to taking on large annotation projects like GTEx and Fly Cell Atlas.

What is "healthy wild-type"?

Before annotation, we check that the samples are considered healthy wild-type by the Bgee definition, particularly no knock-outs, no mutations, no cancer or other diseases, and no treatment. This step is to ensure that all data in the Bgee database represents healthy wild-type expression data.

Some common scenarios we exclude: cell lines, cell cultures, intestinal germ-free animals, injured animals, polysomal RNA only hybridized, and animals killed by exsanguination under CO2 anesthesia.

Some common scenarios we include: BMI greater than 18.5 and less than 35, fasted animals (considering reasonable fasting time), dark/light circadian rhythms and temperature variation when reasonable for the animal's physiology, short duration low or high-fat diet, mammary glands from virgin, pregnant, and lactating females, oocytes at different stages of maturation, placenta and extraembryonic components during development, animals selected for their behavior, animals from different strains/ethnicities, human post-mortem tissues, light impulse to stress the animals, killed by cervical dislocation or decapitation, and killed by inhalants (CO2).

What data are captured?

Various data are captured and manually checked, and several data fields must be captured using controlled vocabularies or ontologies. The following data are always captured:

Sample metadata:

  • anatomical entity and/or cell type, developmental stage, sex, strain/ethnicity, species

Experiment metadata:

  • scRNA-seq and bulk RNA-seq: library ID, experiment ID, sequencing technology, sequencing platform, sequenced transcript part (3', 5' or full length)
  • affymetrix: experiment ID, chip ID
  • in situ hybridization: experiment ID, evidence ID
  • EST: library ID

Annotation Process

Annotation Guidelines can be found on the Bgee expression-annotations-documents repository.

Ontologies Used

The following ontologies are used for the associated use cases listed below.

OntologyOBO PrefixUse CaseLast Import
Uberon composite metazoan anatomy ontologyUBERON, othersanatomical structure annotation2020-09-16
NCBI organismal classificationNCBITaxonspecies annotation2020-12-14
Composite developmental and life stage ontologyUBERON, othersdevelopmental stage annotation2021-03-01
Cell OntologyCLcell type annotation2020-09-16

How are data annotated with ontologies?

Bgee manually curates expression data using ontologies and controlled vocabularies to ensure accurate annotation, allow data propagation, and support species-to-species comparisons. The Bgee curators prioritize using the most accurate term to annotate data. We also internally keep track of the quality of each annotation to support re-annotation and annotation quality control. Generally, when there is no indication provided for annotation, the root of the ontology is used.

Anatomical Entity

Uberon, the multi-species anatomy ontology is used to annotate anatomical entities. When the experiment is done using cells it can be appropriate to annotate using the Cell Ontology. Additionally, there will be cases where the anatomical entity is a post-composition of an anatomical structure and a cell type. There are a few special cases to look out for, which are listed below.

TermUberon IDUberon Name
missing organ, embryo stageUBERON:0000922embryo
missing organ, adult stageUBERON:0007023adult organism
missing organ, unknown stageUBERON:0000465material anatomical entity
muscleUBERON:0002385muscle tissue
skeletal muscleUBERON:0001134skeletal muscle tissue
human breastUBERON:0005200thoracic mammary gland
skinUBERON:0000014zone of skin
whole organism adult or whole body adultUBERON:0007023adult organism
Broca's areaUBERON:0001870frontal cortex
hippocampusUBERON:0001954Ammon's horn
unfertilized eggCL:0000025egg cell

Developmental Stage

We use species-specific developmental stage ontologies when available for the species. When a species does not have a dedicated developmental stage ontology, we report the developmental stages directly by using the metastages of Uberon. Alternatively, we create species-specific developmental stage ontologies when none exist for a given species. For species-to-species comparisons, all developmental stage ontologies are mapped back to the metastages of Uberon.

The following species-specific developmental stage ontologies are used in Bgee:

SpeciesOntology
humanhsapdv.obo
mousemmusdv.obo
flyfbdv.obo
zebrafishzfs.obo
worm (c. elegans)wbls.obo
frogxao.obo
ratrnordv.obo
cowbtaudv.obo
lizardacardv.obo
chickenggaldv.obo
gorillaggordv.obo
opossummdomdv.obo
rhesus macaquemmuldv.obo
platypusoanadv.obo
bonoboppandv.obo
orangutanppygdv.obo
chimpanzeeptrodv.obo
pigsscrdv.obo
medakaolatdv.obo
platynereispdumdv.obo

Cell Type

The cell ontology is used to annotate cell type. This is only applicable to single-cell RNA-seq experiments. Bgee does not perform its own clustering and cell type annotation, instead the cell type annotations provided by the authors are mapped back to the most appropriate cell ontology term.

Strain/Ethnicity

We use the UniProt controlled vocabulary to report strains.

We report 'subspecies' (as defined by NCBI Taxonomy) as a strain with the parent species annotated as the 'speciesId'.

For humans, this field is where we annotate ethnicity. The accepted values (as suggested by the FDA) are American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, and White.

Sex

Sex is used to annotate the sex of the animal the sample was taken from. The possible values are male, female, hermaphrodite, NA (not available or unknown), or mixed (mixture of different sexes in the sample).

Technologies and Protocols

We have a list of acceptable protocols for both scRNA-seq and bulk RNA-seq, listed below. We try to capture information about the kit used for preparing libraries to determine which RNA population could be captured. More information about scRNA-seq protocols can be found in our comparative guide.

bulk RNA-seqscRNA-seq
Illumina Genome AnalyzerSmart-seq
Illumina Genome Analyzer IISmart-seq2
Illumina Genome Analyzer IIx10X Chromium V2
Illumina HiScanSQ10X Chromium V3
Illumina HiSeq 1000
Illumina HiSeq 1500
Illumina HiSeq 2000
Illumina HiSeq 2500
Illumina HiSeq 3000
Illumina HiSeq 3500
Illumina HiSeq 4000
Illumina HiSeq X Ten
Illumina MiSeq
Illumina NextSeq 500
Illumina NextSeq 550
Illumina NovaSeq 6000
Ion Torrent Proton

Multi-Species Integration

To make our annotations comparable between species, we remap all source information to the composite-metazoa version of Uberon. This means that we carefully check and use cross-references in Uberon and other ontologies to map, for instance, data from ZFIN or FlyBase to this common representation.

We also use the taxon constraints (for instance this Protégé plugin) and lineage-specific General Class Inclusion (GCI) relations to produce species-specific anatomical representations from this common ontology, which integrates almost all other animal anatomy ontologies.

Finally, we define relations of historical homology between anatomical entities to determine which entities can be compared between species. See our Anatomical homology search and our repository of similarity annotations between anatomical structures for more information.

X Tutup