Glossary

Glossary for Gene Expression Profiling and Genotyping Terms

Allele

One of two or more alternative forms of a gene or marker sequence and differing from other alleles at one or more mutational sites based on phenotype or sequence. Polymorphisms are included in this definition.

ANOVA

Analysis of variance. A statistical test for determining differences in mean values between two or more groups.

α-value

The nominal probability (set by the investigator) of making a type 1 error.

Bayesian probability

The probability of a proposition being true, which is conditional on the observed data.

Bonferroni correction

A family-wise error rate (FWER) control procedure that sets the α-value level for each test and strongly controls the FWER for any dependency structure among the tests.

Box plots

Box plots present various statistics for a given data set. The plots consist of boxes with a central line and two tails. The central line represents the median of the data, whereas the tails represent the upper (seventy-fifth percentile) and lower quartile (twenty-fifth percentile). Such plots are often used in describing the range of log ratios that is associated with replicate spots.

Case

In a microarray experiment, a case is the biological unit under study; for example, one soybean, one mouse or one human.

Detection Call

A qualitative value that suggests a level of confidence in the signal calculated for that probe. For some platforms, the detection call reflects the quality of the nucleic acid spot on the microarray, similar to Flag/No Flag scores. On other platforms the detection call reflects the abundance of the target transcript or the concordance of results between multiple probes in a probe set, similar to Absent/Present calls. While the final detection call is qualitative, it is usually based on quantitative assessments and complex statistics.

Experiment

The complete set of microarray hybridizations performed as an experiment for a common purpose. Here we take experiment to mean an observational or perturbing study. An experiment will be often equivalent to a project.

External RNA Control

An RNA species added to a biological sample during processing for the purpose of assessing technical performance of a gene expression assay. Different external RNA controls may be used to monitor different processes. In microarray research, external RNA controls may be added to total RNA to assess the initial enzymatic processes or to the labeled cRNA to assess hybridization.

False discovery rate (FDR)

The expected proportion of rejected null hypotheses that are false positives. When no null hypotheses are rejected, FDR is taken to be zero.

Fold change

A metric for comparing a gene's mRNA-expression level between two distinct experimental conditions. Its arithmetic definition differs between investigators.

Gene

A general term used to represent both a DNA segment and the collection of RNA transcripts derived from it. In the DNA usage, a gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions. In the RNA usage, a gene often refers to the targets measured in a gene expression assay.

Gene Ontology

A way of describing gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner, using widely controlled vocabulary.

Genotype

The total sum of the genetic information of an organism that is known and relevant to the experiment being performed, including chromosomal, plasmid, viral or other genetic material.

Heatmap

Heatmaps consist of small cells, each consisting of a colour, which represent relative expression values. Heatmaps are often generated from hierarchical cluster analyses of both samples and genes. Often the rows represent genes of similar expression values, whereas the columns indicate different biological samples. Heatmaps offer a quick overview of clusters of genes that show similar expression values.

Long-range arror rate

The expected error rate if experiments and analyses of the type under consideration were repeated an infinite number of times.

non-synonymous SNP

SNPs may fall within coding sequences of genes, noncoding regions of genes, or in the intergenic regions between genes. SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to the same polypeptide sequence is termed synonymous (sometimes called a silent mutation) - if a different polypeptide sequence is produced they are non-synonymous.

Normalization

The process by which microarray spot intensities are adjusted to take into account the variability across different experiments and platforms.

Null hypothesis

The hypothesis that is being tested in a statistical test. Typically in a microarray setting it is the hypothesis that states: there is no difference between gene-expression levels across groups or conditions.

Overfitting

This occurs when an excessively complex model with too many parameters is developed from a small sample of 'training' data. The model fits those data well, but does so by capitalizing on chance variations and, therefore, will fit a fresh set 'test' data poorly.

Permutation test

A statistical hypothesis test in which some elements of the data are permuted (shuffled) to create multiple new pseudodata sets. One then evaluates  whether a statistic quantifying departure from the null hypothesis is greater in the observed data than a large proportion of the corresponding statistics calculated on the multiple pseudo-data sets.

Phenotype

The detectable outward manifestations of a specific genotype, incl. physiological parameters.

Plasmode

A real (not computer simulated) data set for which the true structure is known and is used as a way of testing a proposed analytical method. A rare thing!

Power

Probability of rejecting a null hypothesis that is false. However, power has been defined in several ways for microarray studies.

Probe

A defined piece of nucleic acid which is used to identify specific DNA or RNA molecules bearing the complementary sequence. Some microarray platforms rely on a single oligonucleotide probe to assay an RNA target. Other microarray platforms combine data from multiple probes, arranged in a probe set, when calculating expression values for a target.
Bead-based assays attach oligonucleotide probes to a microscopic bead surface.
PCR-based assays utilize a pair of oligonucleotide primers to identify and amplify their intended RNA target, and in some cases, an oligonucleotide detection probe hybridizes to the amplified target. The definition relys on the standardized nomenclature suggested by MGED/MIAME (Brazma et al., 2001).

p-value

The probability, were the null hypothesis is true, of obtaining results that are as discrepant or more discrepant from those expected under the null hypothesis than those actually obtained.

p-value histograms

p-value histograms have abscissae that range from 0 to 1 and contain the p-values for a test of differential expression for each gene. They are common supplements to the formal mixture models that enable the popular calculation of false-discovery rates.

QTL

Quantitative inheritance or polygenic inheritance refers to inheritance of a phenotypic characteristic (trait) that is attributable to two or more genes and their interaction with the environment. Unlike monogenic traits, polygenic traits do not follow patterns of Mendelian inheritance (qualitative traits). Instead, their phenotypes typically vary along a continuous gradient.
A quantitative trait locus (QTL) is a region of DNA that is associated with a particular phenotypic trait - these QTLs are often found on different chromosomes. Knowing the number of QTLs that explains variation in the phenotypic trait provides information about the genetic architecture of a trait. It may tell that height in man is controlled by many genes of small effect, or by a few genes of large effect.
 

Repeatability

The ability to provide closely similar results from replicate samples processed in parallel at the same instrument using the same microarray type.

Replicate

Biological replicates (replicate that consists of independent biological samples made from different individuals/cell cultures) and
technical replicates (replicate where the same sample is used on identical microarrrays to assess technical variation within an experiment) are discriminated.

Reproducibility

The ability to provide closely similar results from replicate samples processed by different persons, on different instrument (same manufacturer) using the same microarray type.

RMA

Robust Multiarray Average; A quantitative measure of the relative abundance of a transcript. RMA is a summary measure of probes on arrays (Affymetrix: perfect match (PM) features only). The values are background-adjusted, normalized and log-transformed.

Sampling variation

The variability in statistics that occurs among random samples from the same population and is due solely to the process of random sampling.

Signal

The quantitative expression value for each probe derived from a hybridization image after preprocessing steps, such as background subtraction and summarizing data from multiple probes, as well as normalization procedures that remove systematic artifacts. Signals are not the raw fluorescent or chemiluminescent intensities captured in a pixelated microarray image.

SNP

A Single Nucleotide Polymorphism is a DNA sequence variation occurring when a single nucleotide (A, C, G, or T) in the genome differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, TTGGAGCT to TTAGAGCT, contain a difference in a single nucleotide. In this case there are two alleles: G and A. Almost all common SNPs have only two alleles.

synonymous SNP

SNPs may fall within coding sequences of genes, noncoding regions of genes, or in the intergenic regions between genes. SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to the same polypeptide sequence is termed synonymous (sometimes called a silent mutation) - if a different polypeptide sequence is produced they are non-synonymous.

Transformation

The application of a specific mathematical function so that data are changed into a different form. Often, the new form of the data satisfies assumptions of statistical tests. The most common transformation in microarray studies is log2.

t-tests

Statistical tests that are used to determine a statistically significant difference between two groups by looking at differences between two independent means.

Type 1 error

A false positive, or the rejection of a true null hypothesis; for example, declaring a gene to be differentially expressed when it is not.

Type 2 error

A false negative, or failing to reject a false null hypothesis; for example, not declaring a gene to be differentially expressed when it is.

Volcano plots

Volcano plots are used to look at fold change and statistical significance simultaneously. Cartesian plots typically show - log10(p-values) or log odds on the ordinate and fold-change values on the abscissa for all genes in a data set. The name stems from the volcano shape of the plots. The upper corners of the plot represent genes that show both statistical significance and large fold changes.



Many definitions given above are based on the recommendations of the Functional Genomics Data Society (FGED, incl. MIAME and MAGE), Clinical and Laboratory Standards Institute (CLSI), The Microarray quality Control Project (MAQC) of the FDA, Allison et al., 2006