GNA-VSNS Biocomputing Course Proposed Unified Master Outline as of 4 January 1995.

Revision History.
4 Jan 1995  Changed a few details of the General Rationale for MathAn Chapter 
            (Chapter 3) as suggested by Andreas Dress.
2 Feb 1995  No more specific changes were requested. Removed some disclaimers.

The following document is the course coordinator's draft for a unified 
master outline.
The URL of this document is:
http://www.techfak.uni-bielefeld.de/techfak/persons/fuellen/syllabus.html

Every author is kindly asked to:

a) Check whether he agrees with the changes to his outline;
b) Check whether all prerequisites he needs are covered by preceding
   chapters, in particular whether David Steffen's chapter trains
   the students in all network resources he needs for his text and homework;
c) Make comments on as many individual chapters as possible, and tell us
   whether he likes the selection of topics, and where he sees a problem;
d) Comment on the general impression, the 'whole picture' he gets.

Also, every other subscriber of the list is kindly asked to say something
regarding his general impression.

The discussion shall be finished by Jan 12, and a large part of it
shall take place on the listserv. However, pls note that some criticicm
may be best directed in private email to the chapter author, and/or myself. 

**********************************************************************
In particular, pls note that due to lack of time, the chapter outlines
are what Robert and myself made out of the submissions, and in general,
the author may NOT have been consulted about changes.
**********************************************************************



Chapter Length and Time Estimates
=================================
=================================

Although we try to draw a realistic picture of the final textbook,
the figures given in the outline are just estimates. No-one will be
scolded for writing a section more or less. Also, every author may
include hypertext links to optional sections providing more details.

1 Section is ca. 30 lines of text (70 characters per line)
10 Sections are ca. 5 pages US letter size / DIN A4, equivalent to
6 hrs study time by an average student (some self-descriptions of
prospective students are available here

6 hrs study of the course material + 4 hrs homework + 2 hrs online 
interaction w/ the instructor shall be allocated per student per week.


Cross-References between Chapters
=================================
=================================

In the following draft outline, the ''->'' sign will mark important
connections between the chapters. In these cases, the respective authors
should get in contact to coordinate their texts. The following acronyms
are used:

Introd  (Ch. 0)   Introduction
PrwAli  (Ch. 1)   Alignments (espec. Pairwise Alignments)
Netwkg  (Ch. 2)   Networking
MathAn  (Ch. 3)   Mathematical Analysis of Sequence Data
MulAli  (Ch. 4)   Multiple Alignment
ProtEn  (Ch. 5)   Protein Folding with Genetic Algorithms: A teaser

I prefer to use these acronyms, because the numbering may be subject to change.


General Prerequisites
=====================
=====================

Math experience, undergraduate level.
Must be close to Completion of Undergraduate Studies in
either CS or biology.
CS students w/o biology experience shall take a look at
Peter Murray-Rusts "Principles of Protein Science"
Biology students w/o programming experience shall take a look into
[to be announced; can anybody provide a good pointer ??].
Neither is mandatory, however.


WEEK 1 
======
======

Introduction. (5 sections, diagrams(???), images(???))
========================
Author: David Steffen, steffen@bcm.tmc.edu

General Rationale: You will be given an overview over the motivations that
biologists have for bothering with current sequence analysis tools,
and you will be given plain-English explanations of the most important of them.
CS students will be given a basic intro to central biological concepts. 
Biology students will be introduced to some new information relevant to the
course, [and so it is hoped that they will not skip the chapter.]

The draft can be found here.
It is mirrored 
here.


WEEK 2 + 3
==========
==========

Chapter 1. Alignments ("PrwAli"). (20 sections)
======================
Author: Robert Giegerich, giegerich@techfak.uni-bielefeld.de
        Dave Wheeler, wheeler@bcm.tmc.edu

General Rationale:
We introduce the most fundamental concepts in sequence analysis: the notions
of distance, similarity and alignment of two sequences. You will learn
about the essential ideas behind the programs which compute these measures.
You will learn that this is a very general and flexible concept, and see some
of its specific variations.

1.1. Definition of alignment, edit-distance, optimal alignment (12 sections)
====

Rationale:
You will study the abstract notion of distance between two sequences and the 
problem of the biological adequacy of this model. The standard method for
calculating distance and alignment will be discussed.

Global alignments
  Definition: Edit distance,  alignment
  Examples
  The Dynamic Programming Scheme
  Variations (linear gap costs)
  Commonly used programs 
[-> Chapter 2, Netwkg (?)]
[-> Chapter 3, MathAn]
[-> Chapter 4, MulAli]

Local alignments : Finding the best alignment for a short sequence in a longer one
Local similarity : Detecting parts of good local alignment
[-> Chapter 2, Netwkg (?)]

1.2. Approximate methods (4 sections)
====

Rationale:
The precise optimal alignment as studied above is too expensive to use
in large data base searches. There are heuristic methods calculating
close-to-optimal alignments that are much faster. They are used most frequently
in practice. Here you will learn how they work, and in chapter 2 you will learn 
how to use them via the network.

Motivation: Computational Costs of optimal pairwise alignment.
Fasta
[-> Chapter 2, Netwkg]
BLAST
[-> Chapter 2, Netwkg]
Advantages & disadvantages

1.3. Weight matrices for scoring alignments (4 sections)
====

Rationale:
The area of application determines the way in which sensible costs
are determined for individual deletions, insertions and replacements.

Nucleotide
  Identity
  Transversions
Protein
  Identity
  PAM
  BLOSM
[-> Chapter 2, Netwkg]

Homework:
========

Exercise the dynamic programming scheme on some hand-sized examples.
Observe the effect of different choices of weights.

(After chapter 2, you will use available tools like Fasta and BLAST
 to calculate alignments and to do similarity search over large data bases.)


WEEK 4-5.
==========
==========

Chapter 2. Networking. ("Netwkg") (20 sections)
==========

Author: David Steffen, steffen@bcm.tmc.edu
Note that the author may not have been consulted about changes made by the 
course coordinator during preparation of this draft master outline !
The original outline can be found 
here.
It is mirrored
here.

General Rationale: In this chapter, you will learn how to use network 
resources needed in the rest of the course. 

2.1. An overview of computer networking (3 sections)
====

Rationale: The purpose of this chapter is to demystify networking.

2.2. How to Approach the Internet (5 sections)
====

Rationale: After studying this subchapter, you will be able to make sense of the diversity of services now available on the Internet and those that will become available in the future. You can then continue to exploit the internet long after the specific resources given in this chapter go out of date. You will have some background knowledge on the history of important tools, note important caveats, and receive an Intro to finding resources. (Little or no practical instruction.)

Term definitions and an introduction to currently available tools

Networking Overview and Caveats

How to Find Things on the Net

2.3. Using the Internet In This Course (10 sections)
====

Rationale: You will learn how to use the network resources needed in the rest of the course, with an emphasis on WWW. The general introduction given up until now shall enable you to learn on your own how to use things like telnet interfaces and mail servers as you need to after the course is over.

Data Banks

Servers offering Manipulation of your Data.

2.4. Other Internet Resources (2 sections)
====

Rationale: There is a lot more available on the 'net. Just take a look !

Homework: 
========

You will be asked to retrieve sequences from Data Banks, and run them
through a Multiple Alignment server.
[-> Chapter 4, MulAli]
You will use one typical email server.
[-> Chapter 1, PrwAli]


WEEK 6 + 7. [depending on Andreas, possibly cut to 1 week ??]
==========
==========

Chapter 3. Mathematical Analysis of Sequence Data. ("MathAn") (20 sections)
==========
Prospective Author: Andreas Dress, jordan@mathematik.uni-bielefeld.de

General Rationale: You will learn to appreciate the specific structural features 
of phylogenetic trees, and the relation between (a) tree-reconstruction 
and (b) compatible-clustering procedures on the one hand, and (a') Darwinian 
phylogenetic studies and (b') Linnean systematics on the other hand.
Using various ways of quantifying similarity via sequence comparisons,
we will finally study several algorithmic procedures for reconstructing 
phylogenetic relationships.

3.1. The Goals of Sequence Classification (2 sections)
====

Rationale: You will get an idea for what Sequence Classification
is useful.

Efficient Bookkeeping in Sequence Data Banks
Structure/Function Prediction from Sequence Homology
Elucidation of Structure/Function Evolution
Elucidation of Species Evolution (Molecular Phylogenetics)
[-> Chapter 2, Netwkg]
[-> Chapter 4, MulAli]

3.2. Concepts of Sequence Similarity (5 sections)
====

Rationale: You will understand different concepts of Sequence Similarity,
how to calculate it, and the usefulness of more sophisticated concepts of
similarity. The strong connection to chapter 1 will be discussed.

Qualitative Similarity Concepts (binary features)
Quantification of Sequence Similarity (Weighting Schemes, Hamming Distances, 
  Correction Methods like Kimura, Jukes-Cantor)
[-> Chapter 1, PrwAli]
[-> Chapter 4, MulAli]

3.3. Cluster Analysis and Phylogenetic Trees. (12 sections).
====

Rationale: You will learn how some popular Clustering / 
Phylogenetic Analysis Methods actually work.

The Goals of Cluster Analysis: Construction of
  a Global Classification Scheme from Local (Dis-)Similarity Data
Average Linkage
The Goals of Phylogenetic Analysis: Inferring Species Evolution
Neighbor Joining
Split Decomposition
Some Relevant Examples
[-> Chapter 2, Netwkg]
[-> Chapter 4, MulAli]

Homework: 
========

For some real-life examples, you will construct phylogenetic trees.
[-> Chapter 2, Netwkg]


WEEK 8+9.
==========
==========

Chapter 4. Multiple Alignment. ("MulAli") (20 sections)
=============================
Author: Georg Fuellen, fuellen@dali.mathematik.uni-bielefeld.de

General Rationale. You will understand why Multiple Alignment is considered
a very difficult problem, you will study approaches that try to 
reduce the number of steps needed to calculate the optimal solution,
and you will study fast heuristics. You will take a look at the 
mutual dependency between Multiple Alignments and Phylogenetic Trees.

4.1. The Complexity of Multiple Alignment.  (2 section)
====

Rationale: You will understand why there are so many possible Alignments, 
and that, unlike pairwise alignment, even an intelligent order of 
computation is not enough to render feasibility.

The number of Alignments of 3 Sequences.
Complexity of the straightforward Dynamic Programming Approach.
[-> Chapter 1, PrwAli]

4.2. How Carillo and Lipman cut down the "Volume of the Dynamic Programming
====
Matrix" (8 sections, diagrams, and maths.)

Rationale: You will understand geometrically the idea of using pairwise
projections for obtaining bounds on the "volume" the dyn. progr. algorithm
needs to explore. Then you will be shown the maths, and correlate
both forms of presentation.

The Dynamic Programming Hypercube.
[-> Chapter 1, PrwAli]
Geometric Explanation of the idea.
Mathematical Derivation of the Bound.
Practical Consequences on the running time.

4.3. Examples, and a few remarks about heuristics. (5 sections.)
====

Rationale: You will understand how the most popular Multiple Alignment 
heuristic works, and how its results compare to optimal solutions for
small examples.

Heuristic Alignment Procedures (i.e. Alignment along a Tree) in Clustal, etc
[-> Chapter 1, PrwAli]
[-> Chapter 2, Netwkg]
[-> Chapter 3, MathAn]
Variation: Progressive Alignment (Doolittle et al.)
Comparison of heuristic and optimal (Carillo-Lipman) alignments.

4.4. One can Formulate the Problem more Naturally by using 'Natural Gap Costs'.
====
(3 sections, diagrams, no maths.)

Rationale: There is a trade-off between simplifying
a real-life problem and not being able to tackle it efficiently.

Natural Gap Costs.
Quasi-Natural Gap Costs.
Consequences of the Cost Definition on the running time.

4.5. A Technique for Lifting Sequences in a Phylogenetic Tree and Concurrently
====
Aligning them.  (2 sections, diagrams, example using Mosaic Forms)

Rationale: The key ideas behind Jiang, Lawler and Wang's approximation
scheme for multiple tree alignment, and a discussion about its practical 
usefulness.

Polynomial-time approximation Schemes.
Lifting sequences in a tree via Dynamic Programming.
[-> Chapter 1, PrwAli]
[-> Chapter 3, MathAn]
Example using very short sequences (Mosaic Form)

Homework: 
========

You will generate and evaluate multiple alignments from real-life sequences.
[-> Chapter 2, Netwkg]


WEEK 10 (+11 ?).
==========
==========

Chapter 5: Protein Folding with Genetic Algorithms: A teaser. ("ProtEn") (15 sections)
==================================================
Author: Steffen Schulze-Kremer, steffen@chemie.fu-berlin.de

General Rationale: You will get an impression on how ideas inspired by 
biological phenomena can be put to use in CS, and how the resulting 
optimization algorithm can then again be used to solve a problem from biology.

5.1. Introduction (4 sections)
====

Rationale: You will understand the principle and benefits of a genetic 
algorithm.  You will be given an overview on Protein Folding and Drug Design, 
and how Drug Design relates to the overall course.

Genetic Algorithm 
Protein Folding and Drug Design
Alignments can assist the drug designer.
[-> Chapter 2, Netwkg]
[-> Chapter 4, MulAli]

5.2. Applying GA to Protein Folding (8 sections)
====

Rationale: You will understand how a widely applicable CS idea (genetic
optimization) can be fitted to a problem from biology. 
You will see/develop an implementation of a genetic algorithm 
for a protein folding simulation. 

Representation Issues and Genetic Operators (3 sections)
Fitness Function (5 sections)

5.3. Results & Discussion (4 sections)
====

Rationale: 
You will be introduced to concepts dealing with the analysis of the 
behavour of a GA.

Search Performance and Prediction Performance + Evaluation of Fitness Function 

Homework (__optional__ programming assignments)
========

Implementation of Basic GA in ANSI-C: Genetic Cycle, Operators
Protein Converter: 3D <--> Torsion Angles
Simple Protein Structure Fitness Functions
statistical and graphical evaluation of results of GA
[-> Chapter 2, Netwkg ???]


Introductory Material for Plenum Discussions / Appendices
=========================================================
=========================================================

    Appx _ Practical Issues. Will there ever be a standard data bank format ?
           Will there ever be standardized software components in Biocomputing ?
           If yes, which programming language shall be used ?

    Appx _ Ethical Issues. Can terrorists use our results to build a "biological
           bomb" ? Who else can ? How many deseases are there that can be
           cured by man-made peptides and proteins ?

    Appx _ Biocomputing in the year 2100 - The big picture.
           Having solved Multiple Alignment and Protein Folding, what's next ?
           Contact Toru Yao, yao@atlas.rc.m-kasei.co.jp.


Back to Biocomputing Course Home Page.
VSNS-BCD Copyright.
Georg Fuellen