Your questions?
Tree of Life movie webpage; Other cool links
First 14 slides from presentation at DIMACS
![]()
![]()
Powerpoint slides are here
![]()
Pairwise alignment
A) DOT PLOT See the Dotlet exercises form last week.
The easiest way to align two sequences is to use a dotplot. In its most straight forward implementation the two sequences to be aligned are written along the coordinate axis.
In more realistic implementations a window of 5 to 20 nucleotides or amino acids is slid along one of the axes (i.e., sequences) and compared to every possible window on the other axis (sequence). The dot intensity is adjusted to reflect the percent identity (or similarity) in the two windows.
![]()
Optimal global and local alignments.
There are many different algorithms to calculate pairwise sequence alignments. For two sequences it is "easy" to calculate an optimal global alignment. (According to the motto: "It can be easily shown" -- see here, and here -- the links refer to a bioinformatics course given at the Univ. of Munich). The so called Needleman-Wunsch algorithm is widely used, it optimizes a positive alignment score, a related (and under some conditions equivalent approach) is to minimize the differences between to sequences.
![]()
Multiple Sequence Alignments
![]()
CLUSTAL, CLUSTALW and CLUSTALX
Usually global alignments are the easiest to calculate (local see below)
One of the easiest to use, most sophisticated, and most versatile alignment programs is clustalw
Clustalw runs on all possible platforms (unix, mac, pc), and it is part of most multiprogram packages, and it is also available via different web interfaces (for examples here, here, and here).(Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244;
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680).
Clustalw uses a very simple menu driven command-line interface, and you also can run it from the command line only (i.e. it is easy to incorporate into scripts for repeated analyses.)
Clustalx uses the same algorithms as clustalw. However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment. Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements. Clustalx is available for different platforms at the ebi's ftp site (follow your platform, clustalx is stored in the clustalw folders)
Clustal reads and writes most formats used by different programs. The easiest format is the FASTA format:
>
name of sequence or any other information goes in the first line. This line starts
with ">". The line can be longer than 80 characters. The first line
ends with the first paragraph sign.p
The second line contains the sequence
itself; numbers and other non standard characters
are ignored. Be careful if you download sequences. Often the transfer
programs introduce paragraph signs every 100 characters, and the end of a command
line frequently ends up as the beginning of the sequence.
All sequences to
be read should be in a single file.
(sample clustalw input file)
(sample clustalw output file)
Clustal
also reads aligned sequences. If you input aligned sequences you can go directly
to the tree section.
!! Be careful if you make a mistake, and the sequences are
not aligned, your tree will look strange!!
!!! ALWAYS CHECK YOUR ALIGNMENT!!!
Clustal also is useful to reformat and edit alignments, it is very forgiving in reading formats, e.g., you can open the clustal format (*.aln) in a text editor and delete columns and reload the file into clustalw, and output it in the other formats available.
For calculating an alignment, you can select different substitution matrices, and gap penalties (end-gaps can be considered differently!)
Clustal is better than its reputation. It is doing a great job in handling gaps, especially terminal gaps, and it makes good use of different substitution matrices.
To align sequences clustal performs the following steps:
1) Pairwise distance calculation
2) Clustering analysis of the sequences
3) Iterated alignment of two most similar sequences or groups of sequences.
It is important to realize that the second step is the most important. The relationships found here will create a serious bias in the final alignment. The better your guide tree, the better your final alignment. You can load a guide tree into clustal. This tree will then be used instead of the neighbor joining tree calculated by clustalw as a default. (The guide tree needs to be in normal parenthesis notation WITH branch lengths).
NOTE that clustalw and other multiple sequence alignment programs do NOT necessarily find an alignment that is optimal by any given criterion. Even if an alignment is optimal (like in the Needleman Wusch algorithm), it usually is not UNIQUE. It often is a good idea to take different extreme pathways through the alignment matrix, or to use a program like tcoffee that uses many different alignment programs to build a consensus .
![]()
A recently very popular alternative to the clustalw is MUSCLE, especially if you have very large datasets, or if you have very divergent sequences. Like clustalw it is run from the commandline, it runs on all possible platforms, and it can generate profile alignments, i.e., if you want to align two homologous protein families, you first align them each seperately, then you take both of them and align them with one another, or you could use this to add an additional sequence to an already existing alignment. The MUSCLE homepage is here.
Below is a session with muscle in which first single alignments are created for a sequences of the V-ATPase and F-ATPase catalytic subunits seperately, then the resulting alignments are aligne with one another (also, at each step the alignments are "refined"):
muscle -in VatpA.fa -out VatpA.afa
muscle -in VatpA.afa -out VatpA.rafa -refine
muscle -in beta.fa -out beta.afa
muscle -in beta.afa -out beta.rafa -refine
muscle -profile -in1 beta.rafa -in2
VatpA.rafa -out Abeta.afa
muscle -refine -in Abeta.afa -out Abeta.rafa
The file utilized are here and here, the result is here
![]()
Other
programs often used for global multiple sequence alignments
(We
will not use
these program in this course; if you are already confused by the information provided,
skip this section):
A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed using netscape at " http://www.cse.ucsc.edu/research/compbio/sam.html ".
If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX. It's only worth your effort if you are getting gray hairs because of a dataset you cannot reliably align.PILEUP in the GCG package generates alignments that are very similar to clustalw. The TREE programs in GCG are currently considered by many to be worthless (UPGMA). It is planned -since over six years- to incorporate PAUP into GCG in the "near future".
TCOFFEE extracts reliably aligned positions from several multiple or pairwise sequence alignments. It requires more thought and attention from the user than clustalw, but it helps to focus further analyses on those sites that are reliably aligned. A description is here, a web interface is here (note the book advertisement at the bottom of the page).
![]()
Local Alignments (e.g. MACAW) search the sequences for motives that occur in different sequences. In macaw the user has the option to select different tools to search matching motives, the user can select subsets of sequences or positions to search for similar motives, and the user has to accept/reject each of the motives found.
Alignments by
Eye:
On PCs there
is a DOS program called the Eyeball Sequence Editor (ESEE)
that allows to simultaneously align nucleotide and encoded protein sequences.
Needs some getting used to.
One
useful sequence editor is seaview,
the companion sequence editor to phylo_win. It runs on PC and most unix flavors,
and is the easiest way to get alignments into phylo_win.
Top down approaches (fossil and molecular records, retrodiction of biochemical pathways)
Bottom up (prebiotic chemistry)
Primordial Soup (Miller -> see reading assignment) or Primordial Pizza (Wächtershäuser -> see reading assignments)
The currently favored scientific scenarios for the transition from chemistry to biology is somewhat as follows:
| prebiotic chemistry either on Earth or in Space, in solution or on surfaces or in the gas phase |
| (autocatalytic chemical cycles and chemical networks) |
| ? |
| self-replicating biopolymer |
| ? |
| Emergence of cells, hypercycles or other means to co-select different genes |
| RNA world |
| ?? |
| Invention of protein biosynthesis |
The existence of the RNA world as a transitory stage is supported by the following:
In vitro evolution has succeeded to evolving RNA's with
novel properties, e.g. ATP binding. Jack Szostak's lab is working
to evolve RNAs with template directed RNA polymerization capabilities. The principle
selection scheme is depicted in this diagram at Szostak's web page.
In vitro selection became famous with Sol Spiegelman's experiments on the vitro replication of the Phage Qbeta RNA. In this case selection was for the fastest replicating molecules - they become shorter and lost their ability to infect bacteria.
Later inventions are the SELEX procedure to select for RNA with very specific binding properties (see left), and the selection of ribozymes with altered or new properties. In the latter case growth and selection can be either discrete or continuous. See reading materials for further discussion.
LINKS to sites that report on interesting RNAs and on the RNA world:
An article by Leslie E. Orgel (one of the main proponents of the RNA world) on the The Origin of Life on Earth (with a nice picture depicting the origin and early evolution of life)
An overview on the origin of the RNA-world idea (with nice links to CV of Sidney Altman, one of the scientists who discovered catalytic RNAs)
A step by step review of the Tuerk and Gold Science article on SELEX
All you wanted to know about SELEX (but were afraid to ask...)
The Jack Szostak Lab Home Page
tm RNAs (a RNA that helps target incomplete proteins that result from the translation of broken mRNAs to be degraded)
RNA's Role At Beginning Of Life On Earth (with a nice picture of Harry Noller working in the lab) [requires free registration to view]