David Steffen, Ph.D.
Proteins are variable length[1] linear, mixed polymers of 20 different amino acids[2]. Other terms used more or less interchangably for amino acid polymers are peptides and polypeptides[3]. These topologically linear polymers fold upon themselves to generate a shape characteristic of each different protein, and this shape along with the different chemical properties of the 20 amino acids determine the function of the protein. One of the most important concepts in modern biology is that the functional properties of proteins is determined largely by the sequence of the 20 amino acids in the linear polypeptide chain; that in many cases proteins are largely self-folding. Thus, in theory, knowing the sequence of a protein (the order with which the amino acids occurred) one could infer its function.
What determines the order of amino acids in a protein? The Central Dogma of Molecular Biology describes how the genetic information we inherit from our parents is stored in DNA, and that information is used to make identical copies of that DNA and is also transfered from DNA to RNA to protein. DNA is a linear polymer of 4 nucleotides [4] deoxyAdenosine monophosphate (abbreviated A), deoxyThymidine monophosphate (abbreviated T), deoxyGuanosine monophosphate (abbreviated G) and deoxyCytidine monophosphate (abbreviated C). RNA is a very similar polymer of Adenosine monophosphate, Guanosine monophosphate, Cytidine monophosphate, and Uridine monophosphate. Uridine monophosphate, abbreviated U, is a nucleotide functionally equivalent to Thymidine monophosphate.
A property of both DNA and RNA is that the linear polymers can pair one with another, such pairing being sequence specific. In such double polymers (referred to as a "double helix" due to the shape they assume) G pairs with C and A pairs with T or U. All possible combinations of DNA and RNA double helices occur. One strand DNA can serve as a template for the construction of a complementary strand, and this complementary strand can be used to recreate the original strand. This is the basis of DNA replication and thus all of genetics. Similar templating results in an RNA copy of a DNA sequence. Conversion of that RNA sequence into a protein sequence is more complex. This occurs by translation of a code consisting of three nucleotides into one amino acid, a process accomplished by cellular machinery including tRNA and ribosomes.
Four different nucleotides taken three at a time can result in 64 different possible triplet codes; more than enough to encode 20 amino acids. The way that these 64 codes are mapped onto 20 amino acids is first, that one amino acid may be encoded by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons, specify "end of peptide sequence". Where multiple codons specify the same amino acid, the different codons are used with unequal frequency and this distribution of frequency is referred to as "codon usage". Codon usage varies between species.
The fact that DNA nucleotides need to be read three at a time to specify a protein sequence implies that a DNA sequence has three different reading frames determined by whether you start at nucleotide one, two, or three. (Nucleotide four will be in the same frame as nucleotide one and so on). Both strands of DNA can be copied into RNA (for translation into protein). Thus, a DNA sequence with its (inferred) complementary strand can specify six different reading frames.
It is possible to chemically determine the sequence of amino acids in a protein and of nucleotides in RNA or DNA. However, it is vastly easier at present to determine the sequence of DNA than that of RNA or protein. Since the sequence of a protein can be determined from the DNA sequence which encodes it, most protein sequences are in fact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is a simple laboratory proceedure, so RNA molecules are themselves sequenced as cDNA copies.
Sequence analysis is the process of making biological inferences from the known sequence of monomers in protein, DNA and RNA polymers.
Go Back to the Table of Contents.
As noted above, the difficulty of sequencing proteins means that most protein sequences are determined from the DNA sequences encoding them. Unfortunately, the cellular pathway from DNA to RNA to Protein includes some features that complicates inference of a protein sequence from a DNA sequence.
Once you have obtained a protein sequence, inferring structure and function represent vastly greater problems. As is noted above, the structure of a protein is produced by the folding of a peptide chain back on itself, and in some cases, the association of multiple peptide chains. This folding can occur as rotation can occur around both bonds within the constituent amino acids as well as the bonds that join the amino acids one to another. Unfortunately (or fortunately, as life depends on this fact), the number of possible folding patterns is effectively infinite. To help cope with this daunting problem, biologists have divided the structural features of proteins into levels. The first level of structure, termed primary structure, refers just to the sequence of amino acids in the protein; this is what we know. Decades ago, it was found that polypeptide chains can sometimes fold into regular structures; that is, structures which are the same in shape for different polypeptides. One such shape is helical, and is referred to as an alpha helix. In another such shape, the polypeptide chain folds back and forth, producing a sheet-like surface. This structure is referred to as a beta sheet. There are additional examples of secondary structural types into which a polypeptide might fold, and some peptides do not fold into one of these regular structures at all. In fact, most long polypeptide chains (e.g. virtually all real biological proteins) fold into different secondary structures along different portions of their length.
The secondary structures described above are all very simple and regular; the round and round of an alpha helix or the back and forth of a beta sheet. There are other structures which are found over and over in different proteins which are more complex than this. One example is the helix-loop-helix motif found in many transcription factors[5]. These features are referred to as super-secondary structure. When you look at an actual polypeptide chain, the final shape is made up of secondary features, perhaps super-secondary structural features, and some apparently random conformations. This overall structure is referred to as the tertiary structure. Finally, many biological proteins are constructed of multiple polypeptide chains. The way these chains fit together is referred to as the quarternary structure of the protein.
The reason that this complex nomenclature for protein structure has developed is that the problem of understanding protein structure is so imporant and so difficult. The importance of understanding protein structure comes from two factors working together. The first of these is that the function of the protein is absolutely dependent on its structure. In fact, one of the most common ways for proteins to loose their function is to have their structure disrupted; for example by heat or mechanical stress (e.g. beating an eggwhite); only completely and properly folded proteins "work". The second factor is that it is extremely difficult to determine the structure of a protein experimentally[6]. To date, the primary structure of many sequences has been determined (about 30,000). In contrast, the tertiary structure of many fewer (about 500) has been determined. Obviously, then, it would be of great value if tertiary structure could be determined from primary structure. It is not an exaggeration to state that the ability to exactly predict protein structures and, from that, protein function would revolutionize medicine, pharmacology, chemistry and ecology.
Current research on tertiary structure prediction has used two basic approaches; homology based and ab initio. Homology-based approaches attempt to determine the tertiary structure of a protein by comparing its primary sequence to that of a related proteins whose structure is known. This is a laborious but fairly successful approach. Unfortunately, it requires the existance of similar protein(s) with known structure(s); something not always available. Ab initio approaches try to determine the structure which minimizes free energy. This is done using either Monte-Carlo methods or Neural Net software.
Finally, even if/when you determine the tertiary structure of a protein, techniques have not yet been developed for inferring the functional properties of this protein from its structure.
Go Back to the Table of Contents.
Go Back to the Table of Contents.
Go Back to the Table of Contents.
Databases of protein sequences, including SwissProt and PIR, also exist and can similarly be searched.
Which program should you use to search a database, FASTA or BLAST? This question is about as controversial as that over choices of computers (Mac vs. PC) or religions. In fact, as you enter the world of sequence analysis, you will find religous wars between proponents of different programs over and over. Worse, new programs are constantly appearing. In addition, even after having selected a program, you will frequently have to select values for "parameters" and always have to interpret the output. There are no magic answers to help you do these things. What you will acquire in this course is the background you need to make reasonable decisions on these issues.
Go Back to the Table of Contents.
One of the most useful things people do with sequences is to compare them to other sequences. However, such comparisons are not as easy to make as one might first think. One factor that complicates analysis is that the sequences biologists need to compare are usually not identical, but only similar. In addition to having a small number of substitutions (e.g. a Guanine for an Adenine at one position in a DNA sequence) there will be insertions and deletions in one sequence relative to the other. Also, depending what you are comparing and what you want to learn from the comparison, how you do the comparison will be different. For these reasons, there have been many different kinds of programs written to compare sequences.
Go Back to the Table of Contents.
Go Back to the Table of Contents.
Back to VSNS BioComputing Division Home Page
VSNS-BCD Copyright.