Statistics is the science of data analysis. The data to be encountered in this book are derived from genomes. Genomes consist of long chains of DNA which are represented by sequences in the letters A, C, G or T. These abbreviate the four nucleic acids Adenine, Cytosine, Guanine and Thymine, which serve as fundamental building blocks in molecular biology.
What do statisticians do with their data? They build models of the process that generated the data and, in what is known as statistical inference, draw conclusions about this process. Genome sequences are particularly interesting data to draw conclusions from: they are the blueprint for life, and yet their function, structure, and evolution are poorly understood. Statistical models are fundamental for genomics, a point of view that was emphasized in [Durbin et al., 1998].
The inference tools we present in this chapter look different from those found in [Durbin et al., 1998], or most other texts on computational biology or mathematical statistics: ours are written in the language of abstract algebra. The algebraic language for statistics clarifies many of the ideas central to the analysis of discrete data, and, within the context of biological sequence analysis, unifies the main ingredients of many widely used algorithms.
Algebraic Statistics is a new field, less than a decade old, whose precise scope is still emerging. The term itself was coined by Giovanni Pistone, Eva Riccomagno and Henry Wynn, with the title of their book [Pistone et al., 2000].