Networking

Interim Version 2.5
June 1999

David Steffen, Ph.D.
President, Biomedical Computing, Inc.
6626 Westchester
Houston, Texas 77005
USA

Introduction

The Internet has become an important tool for biological and biomedical research scientists. Using the Internet, it is possible to perform a number of kinds of analyses on research data and to search for and obtain information. Over the last several years, the number of tools and the amount of information relevant to biologists available on the Internet has grown and the ease of use of these tools has grown as well. As a result of both of these trends, the value of Internet resources for biologists now significantly outweighs the costs in time and money of using it. The overall goal of this chapter is to help biologists use the Internet effectively and to illustrate to computer scientists how biologists are currently using the Internet.

This chapter has three specific goals:
  1. To provide background information which will help demystify computer network usage.
  2. To provide an introduction to the resources available to biologists over the Internet in sufficient detail to allow the students in this course[1] to explore and learn how to use these resources on their own.
  3. To provide practical instruction to these students on using the specific network resources needed during the remainder of the course (but read the Release Notes).

It is assumed that the students already can use a Web Browser (e.g. Internet Explorer or Netscape Navigator) to access the course.


Table of Contents


next up previous contents
Next:... Internet Resources ... Up:Introduction


An overview of computer networking

Presumably, you are reading this description of computer networking over the Internet and thus do not need instruction on how to connect to the Internet. If this assumption is incorrect, this section will not help you, nor will the rest of this chapter. The purpose of this section is to provide a very brief and informal theoretical description of computer networks, internets, and THE Internet to help demystify it for the person already using it.

I have no idea what kind of computer you are using or what kind of network connection it has. This is as it should be; I don't need to know. The current state of the Internet is such that connections between computers all over the world function seamlessly and easily, requiring little or no understanding of them by their users. In fact, however, the seamless connection that allows me to effortlessly retrieve documents from an http server or to chat on BioMOO is actually quite complex "under the hood".

There are many different kinds of networks on which your computer might reside. (The computer on which you are reading this might not even be directly connected to a network, but rather might be connected as a terminal to a computer which is on a network). These networks can vary both in terms of their physical and electrical properties (for example, RS485 or Ethernet) and in terms of how data is encoded on these media. For example, Ethernet can carry data encoded as TCP/IP, Appletalk, or Novell Netware. Similarly, Appletalk can be carried over Ethernet or an RS485 network (which Apple calls localtalk). Most likely, the network to which your computer is connected is a Local Area Network (LAN) as opposed to a Wide Area Network (WAN). LANs interconnect a limited number of computers within a limited area. For example, the Sun computer to which I am connected is directly connected to a few tens of computers in the Molecular Biology Computing Resource and the Department of Cell Biology at Baylor. Most of the computers at Baylor are on different networks. To connect to these other computers at Baylor and to computers all over the world requires interconnections between LANs which is usually accomplished with a WAN. WAN connections might be made via T1 lines or ISDN connections, for example.

Connections between LANs are accomplished by specialized pieces of hardware generically called gateways. (Bridges and routers are specific classes of gateways. A bridge just passes packets of information from one network to another whereas a router examines the address in each packet and intelligently routes the packet to the correct network.) Gateways are special purpose computers whose jobs might include determining which packets of information to transmit from one LAN to another, or reformatting packets of data as required by differences between the LANs. A collection of networks interconnected by gateways is referred to as an internet. One internet has grown to include many computers around the world and, in honor of its dominant role in worldwide computing, is referred to as THE Internet. Commonly, internet (with a lower case i) refers generically to any interconnected set of networks, and Internet (with an upper case I) refers to THE Internet.

For two computers on a network or on an internet to communicate with each other, they need to have a unique way of referring to each other. The Internet uses a protocol named IP to accomplish this. The IP protocol assigns an address to each computer on the Internet. These addresses have the form of four numbers, each number having a value between 0 and 255. An example of such an address is:

129.106.28.111

IP addresses are hierarchical; all of the addresses of the form 129.106.28.### might be at one institution, or within one department at that institution, for example. Gateways on the Internet contain maps of how the Internet's constituent networks are connected one to another, so that given such an address, they can determine one or more routes from where they are on the network to the appropriate destination, and thus which gateway(s) to hand any given packet off to.

In theory, this IP addressing scheme should allow networking of over 4 billion computers. At the time the IP addressing system was implemented, this seemed effectively infinite. However, more and more computers are being added to the Internet, and more importantly, because of the way these addresses have been assigned, many fewer than this number of addresses can actually be used. In order to increase the number of available addresses and to be able to add features like security and multimedia support to IP networking, a new version of IP, named IPv6, is being developed. As of this writing, it still is in the experimental phase, but you can expect to hear more and more about IPv6 in the future.

One can (and sometimes does) use numeric addresses such as the above. More commonly, however, one uses an address consisting of words, such as:

merlin.bcm.tmc.edu

This form of address is converted to the numeric form of the address either by application software on your local computer, or more commonly software called a "nameserver" which typically runs on a server to which your computer is connected.

The advantages of using the "name" form of addresses are first, they are more user friendly (easier to remember and understand for humans) and second, they make for more reliable connections. Sometimes it is necessary to change the numeric address of a computer, or to move services from one computer to another, and when this happens, connections to the old numeric address will no longer work. However, nameservers can be automatically updated to associate the old name with a new numeric address, so that connections to the name will continue to work.

To connect to a remote computer via the Internet you need permission from that computer to connect and you need to specify a kind of connection; for example telnet, gopher, http, ftp, or email. These different kinds of connections are characterized by different capabilities, different protocols for communication, different client software (that which the connecting user uses) and different server software (that which the host computer uses). Permission to use a computer is controlled on a "service by service" as well as a "user by user" basis. For example, anyone may make an http connection to merlin.bcm.tmc.edu, only those users with an account may make a telnet connection, and nobody may make a gopher connection. The way that you specify what kind of connection you want is by specifying a "port". This port is not physical, but rather can be thought of as a sub-address.

On merlin.bcm.tmc.edu, Port 23 is connected to a telnet server and Port 8001 is connected to an http server. There are standard ports for different services, port 80 for http and port 23 for telnet, for example, but these standards are just a convenience; any service can be connected to any port. The convenience of using the standard port is that users will know how to connect without being told. In fact, this frequently is where a connection will go by default. For example, most http (web) clients can connect to any port, but will connect to port 80 if no port is specified.


next up previous contents
Next:How Different ... Used Up:Introduction Previous:Overview of ... networking


How Different Internet Services are Used


As noted above, different Internet services are characterized by client software, server software, a set of capabilities they agree upon (e.g. text, pictures in the GIF format, etc.), a protocol by which they communicate (i.e. how the data is encoded in a stream of bytes), and a port on which the client contacts the server to begin communication. The distinction between different services can be blurred because different services can perform similar functions, because different services can share capabilities, and because of the existence of multifunction clients (and servers). Specifically, many modern Web clients are highly multifunctional, being gopher, ftp, email and net news clients in addition to being Web clients. Finally, note that although this chapter deals with services delivered via the Internet, some of these same services can be delivered via other, very different kinds of networks, internets, or dedicated connections.

What is presented here is a very superficial overview of a few of the available Internet services. The interested reader is referred to Connected: An Internet Encyclopedia, an excellent reference on the Internet provided by Brent Baccala of The Free Software Foundation.

telnet

Telnet is one of the oldest of the network services and perhaps the easiest to understand. Telnet allows one computer to "log on" to another computer as if it were a terminal. Once logged on, you frequently will have all the privileges of a local user; you can run programs, create and delete files. This is probably the most common way that users with accounts will use a computer.

Although "full service logins" as is described above are perhaps the most common use of the telnet protocol, in fact as much control as the host's system administrator desires may be imposed on a telnet connection. Thus, a telnet service may be advertised with a public login name and password. Login with this name, however, is likely to be restricted to a limited number of commands. The National Institutes of Health in the United States used, at one point, such a telnet login to disseminate information as to the membership of study sections. Such specialized telnet services have become much less common since the rise in popularity of the Web.

A telnet session can negotiate a range of different protocols, but this almost always includes ASCII [2] text. Because many protocols for other services (e.g. SMTP, HTTP) are encoded as ASCII text, a telnet client can sometimes be used to connect to a server for these other protocols. Most people will use a telnet client the first time connecting to a MOO, and some people will continue to use telnet as their client, although most of us find dedicated clients to be significantly more convenient. Similarly, it is possible to connect to a Web server with a telnet client if you understand the syntax of HTTP. This is almost never done to use a Web server, but is occasionally done when debugging.

From a practical point of view, every telnet host will be different, and thus you will need to learn about each one as you have occasion to use it.

ftp

Telnet is useful for interactive computer access, but is much less useful for transferring files. Ftp is an older service designed specifically for file transfer. Originally it, like telnet, was intended for account owners. However, as it became apparent that it was useful to make files available to the world at large without giving all those wanting the files an account, the variant of "anonymous ftp" developed. In this variant, logging in with a "magic" user name (most commonly "anonymous" or "ftp") eliminates the requirement for a password.

In 1996 I wrote, "To a large extent, use of the World Wide Web has rendered (direct) ftp access obsolete." Although there was and is some truth to that statement (especially given that files on an ftp server can be retrieved by a Web client), the need for ftp clients has not vanished. Some users will choose to avoid them, preferring the simplicity of dealing with a single piece of software, but within their domain ftp clients are more versatile than Web browsers, in some cases one has more control with an ftp client, and for simple file transfer they are quicker and more convenient.

Once logged on via ftp, access to the host filesystem is accomplished by a series of commands. On a unix ftp client, the commands are unix-like; cd to Change Directory and ls to LiSt the files in that directory. To transfer files, you execute either get [3] a file from the host computer or put a file onto it (where allowed). These commands do not depend on the host computer running UNIX! These are ftp commands, some of which happen to be similar to unix commands. A client may choose to hide these commands; a client with a graphical user interface (GUI), for example, might not have typed commands at all, but buttons.

One pair of ftp commands which is especially important to understand are binary and ascii. Ftp transfers occur in ascii (text) mode by default. In ascii mode, the file received may not be identical to the one on the host, as ftp may make changes in the file during transfer, to allow for differences in how different operating systems handle text. For example, UNIX terminates lines with the linefeed character (ASCII 10 decimal), the Macintosh operating system with a carriage return (ASCII 13 decimal) and MSDOS uses one of each. These differences are corrected for during an ascii transfer. This is highly desirable for text files, but catastrophic for binary files like program object code and pictures. Thus, before getting such a file, it is important to issue the binary command. This instructs ftp to transfer files unmodified.

email

Both ftp and telnet are interactive, more or less real time programs. Sometimes it is useful, however, to communicate with another computer, or more commonly, a user on another computer, by leaving them a message which they can read and respond to at their convenience. This is done over the Internet by using email.

Email is a generic term for a variety of processes which can use different protocols and network technology, and which, in many cases uses a more complex client/server model than many of the other protocols discussed in this chapter. At present, most email is transmitted by SMTP (Simple Mail Transport Protocol) via TCP/IP over the Internet. SMTP transmits email on port 25 between two dedicated, full time servers. Although the assumption is that both SMTP servers will be generally available, should the receiving server not be reachable when the transmitting server needs to send email, the email message will be held and the transmission will be retried several times over a period of days until a successful transmission occurs or until the maximum retry time has been exceeded, at which point an error message will be returned to the sender. Importantly, should the initial attempts at sending email fail, the sending server will frequently email a warning to the user while continuing to attempt delivery. Such warnings can be safely ignored.

The SMTP programs discussed above are typically symmetrical (e.g. the program can alternatively serve as client or server), and are complex. Typically, you will not interact with these programs directly. Rather, dedicated client software is used to compose, send, receive, and read email, and it is that software which communicates with the SMTP server. If you send and receive email via a computer that is always on and always connected to a network reachable by your mail server (e.g. a Unix workstation), then incoming mail is saved to a mail spool file on your computer from whence your client software retrieves it, and outgoing email is passed to the SMTP server. Examples of client software running on Unix workstations are mail, mailx, mush, elm, mutt, and pine. Also, as is discussed below, web browsers sometimes can be used as email clients.

If you send and receive email via a computer that is not always on and/or not always connected to the network (e.g. a Mac or a Windows computer), sending email proceeds as above, but receiving email is different in that the SMTP server cannot necessarily get incoming email onto your computer's file system. In that case, a different protocol is used, most commonly POP3 (aka POP). (IMAP is a newer protocol for accomplishing the same task about which you may hear more in the future.) The SMTP server stores your email on a remote host and your local client retrieves it from a POP3 server when you check for mail. Typically, a POP3 account will be provided by whoever provides your Internet access. Thus, to install an email client on a Mac or Windows computer, you typically have to provide the domain name and/or IP address of the SMTP and POP3 servers (frequently the same) and the user name and password for the POP3 account.

The use of email has been expanded in a number of ways. One of the simplest and earliest was to extend it to automatically deal with groups of readers. This is accomplished by having mail delivered to an address which corresponds to a program rather than a user. That program in turn resends that message to all the members of the group of readers. Another email address, corresponding to a different but related program, can allow users to issue commands, e.g. to add or remove themselves from the group. Two of the most common of such software packages are listserv and Majordomo. Majordomo is the software used in this course. Although you do not need a sophisticated knowledge of Majordomo to participate in the course, you should at least learn how to unsubscribe from lists when you are no longer interested in them.

List server software has a number of problems. First, remembering the commands and email addresses (one for each group to which you belong and one for the listserver to issue commands) is difficult. Second, these list servers are completely dependent upon and very picky about email addresses. At some universities, a user's email address is different depending on how they log onto the system, and these addresses change with some regularity. This will usually not introduce problems for receiving mail from a listserver as your system will probably resolve these changing names automagically, but produces recurring problems sending mail to a listserver, as the listserver may require that your posting comes from precisely the same address as is present in the subscription list. Third, because listservers use the email system, messages from a listserver group are intermixed with your private email and with messages from all the other listservers you subscribe to. Fourth, the email program you are likely to be using to read the intermixed mess of messages lack many commands which are extremely useful for efficiently following a group that even the most primitive Usenet software will have. It might be thought that Usenet software would make listservers obsolete in the same way that www ought to make ftp obsolete. Why this has not occurred will be discussed in the next section.

A similar email extension uses "mailservers"; programs which receive email and automatically generate a response. (Most list servers have a limited form of this capability.) Mailservers can provide services that one might expect to perform using ftp or the Web. The way this is done is that email is sent to a program on the host computer rather than to a user and this program responds to information in the Subject or Body of the email message. For example, it is possible to retrieve sequences from Genbank or to use blast to search genbank using mailservers. Use of mailservers have some advantages that have mandated their continued use. Because they are mail based, they are asynchronous. A user can make a request at their convenience, and then go on to other tasks while waiting for their request to be fulfilled. On the server end, requests can be queued to be filled as the host machine has the resources to do so. With the growth of the Web, hybrid servers have appeared where you request a document via the Web but where the document is delivered to you via email. This is useful if the document is large and/or if its generation takes a long time. This is, for example, an option on NCBI's BLAST server.

The biggest disadvantage of mailservers is that communication with them requires a very precise syntax in the email message. Further, there are no standards for this syntax and thus each different mailserver has a different syntax for us to learn. This is an enormous advantage of the hybrid Web/email servers describe above. As of this writing, however, the homework requires that you learn at least a basic subset of the commands for the retrieve server.

Basic email is a text-only service. This would limit the usefulness of mailservers; it would prevent them from returning pictures or programs, for example. Originally, this problem was solved by encoding such binary files as ASCII text and placing that text in the body of the email message. On a Unix system, this could have used the program uuencode (and the companion program uudecode to reverse the process). It was then the responsibility of the user to cleanly remove the encoded material from the email message and to use decoding software to convert it back to a binary file. More recently, email was formally extended to handle "complex" messages using the MIME (Multipurpose Internet Mail Extensions) protocol. Using a MIME-compatible mail client, one can "attach" arbitrary files to an email message, and with luck, the person receiving these email messages will have a client that understands the MIME attachment you sent and their client will handle the attachment appropriately; most commonly by saving it as a copy of the file to their filesystem. Unfortunately, this process frequently fails. In the simplest case, the recipient has an old email client that doesn't understand MIME at all. This is relatively easy to deal with in that this simply defaults to the "old" system; the "attached" file is present as a clearly delimited, encoded block of text (which looks like gibberish) in the body of the email message; a knowledgeable user can infer the nature of the contents and the method of encoding from the email message and (given the correct software) recover the contents. More problematic are transfers that "almost" work; where the recipient has a client that understands MIME, but where the sender uses a MIME type or an encoding method that the recipient's client doesn't understand. In this case the attachment can be garbled or lost altogether. Incompatible encoding methods are, in my experience, a frequent problem. Common encoding methods include uuencode, binhex, base64, with some mail clients using proprietary methods. The only solution to this problem is to pre-negotiate formats between sender and recipient.

Finally, in an attempt to allow for formatting (boldface, italics, colored text, etc.) of email messages, a number of "rich text" extensions have developed. For example, if you use your Web client as an email client, it may allow you to format your messages using HTML. These are completely non-standard and will only work if sender and recipient have clients which agree on a particular extension.

Usenet

The alternative to listservers for group communication is NNTP (Net News Transport Protocol). Net News uses entirely different protocols and software than email (and thus listservers). The distinction between listservers and Net News is made less apparent, however, by the fact that one can typically send email from within a Net News client to the author of a Net News message and because as a result one might receive email in response to a message posted to a newsgroup. To read messages posted to Net News, one runs any one of many newsgroup client programs, subscribes or unsubscribes to groups, and reads the messages one group at a time. As with so many other protocols, many modern Web browsers also have Net News client functionality, and many people read Net News from their Web browser.

The advantages of newsgroups over listservers are seemingly overwhelming. Messages from different groups are kept separate from each other, and all of them are completely separate from your personal email. Subscribing and unsubscribing and posting to and from groups typically involves a keystroke rather than the carefully composed email message sent to an unusual address required by a listserver. Within a group, it is possible to read the messages by topic rather than in the order they are posted. If a topic is uninteresting to you, it is possible to delete all the messages on that topic, and this can even be set up to occur automatically. Finally, as a "moral" issue, when the number of readers becomes large, newsgroups consume fewer system resources worldwide than do listservers. The reason that listservers still exist, however, is that a proliferation of the number of newsgroups causes a variety of problems, including consumption of world wide system resources, and, partially as a result, it takes significant effort and interest within the internet community to create a new group. Thus, listservers are used to create small and/or temporary and/or casual groups whereas newsgroups are set up to allow conversation on more general issues of widespread interest. As a practical matter, listservers are probably as common today (1999) as they were in 1996 when this article was first written.

The classic collection of newsgroups is Usenet. Usenet consists of seven groups (and thus is also known as the "big seven"); sci (science), comp (computers), soc (social or sociology - I am not sure), talk, misc, news, and rec (recreational). It is important to remember, however, that not all newsgroups are part of Usenet. Newsgroups which are not part of Usenet include Bionet, Clarinet, biz, alt, bcm, and many, many others. To tell you the truth, I have lost track of how many Net News groups there are. The number has increased so rapidly that Net news software has had to be rewritten to handle the large number of groups (a situation somewhat analogous to the Y2K problem.) The vast majority of this increase has occured in non-Usenet groups, such that Usenet is now only a small minority of Net News groups.

Most users will not notice the difference between Usenet and non-Usenet groups that are received by their site. However, not all newsgroups are transmitted to all sites. bcm, for example, is a group set up by and for Baylor College of Medicine and is only received within Baylor. In fact, what characterizes Usenet is the rules used therein for group creation. Thus, Usenet is an assurance of general interest (if not quality) which is intended to encourage more system administrators to carry these groups.

Newsgroups use, by convention, a hierarchical naming scheme. Consider two examples:

sci.bio.microbiology
bionet.microbiology

sci.bio.microbiology is the microbiology subgroup of the bio(logy) subgroup of the sci(ence) group of Usenet. Bionet.microbiology is the microbiology subgroup of bionet. Bionet is not part of Usenet. Bionet (like the other non-Usenet groups) has its own system for group creation, however, and provides its own assurance of quality. I personally find the Bionet groups to be the highest quality of the newsgroups.

From a practical point of view, I suggest that newsgroup newcomers of the biological persuasion look over the list of bionet groups and the sci.bio subgroups and subscribe to those that seem interesting. Follow them for a while, and unsubscribe from less interesting groups until a balance between the time required to follow the groups and the value of the information retrieved is reached. In addition, for biologists interested in computing, subscription to a very limited and specific subset of groups in the comp group of Usenet can be invaluable.

Information can be obtained from newsgroups both by "lurking" (reading the group without posting) and by asking specific questions and waiting for the replies. Good citizenship requires, however, that in addition to posting questions one answers when appropriate, though too many answers are more often a problem than too few.

As a final warning, the two biggest problems with newsgroups are a very low signal to noise ratio and an exceedingly low level of common courtesy. (Both of these problems are less on Bionet, in my opinion.) If you post to newsgroups, expect to be gratuitously insulted ("flamed") to an extent you may have never before experienced. Also remember that more than one career has been destroyed by the black hole time sink of Usenet.

MOOs

MOO stands for Mud, Object Oriented, where MUD stands for Multi-User Dungeon. Dungeon is one of the first of the computer games, a text-based game loosely derived from the (non-computer) game Dungeons and Dragons. In Dungeon, one types a series of commands into the computer to cause an imaginary self to maneuver through an imaginary environment to try to avoid being killed by imaginary monsters, to solve imaginary puzzles, and to accumulate imaginary treasures, none of this being multimedia, all being described only in words. In a MUD, many players participate in the same imaginary environment so that they can interact with each other as well as with the computer, adding to the gaming environment. Besides MOO, there are many forms of MUD. They are, in general, systems for creating these games and allowing multiple people to play them over the internet. MOO, however, was created by Xerox because it was felt that text-based virtual reality could be used for serious conferencing as well as for gaming. (The Object Oriented in the name describes the built-in programming language for creating and modifying the virtual reality). I assume most of you are more or less familiar with the concept, being that this course is conducted on a MOO, and in any case, description is a much less efficient way of conveying what this is all about than participation. The links below to connect to two (non-gaming, more or less serious) MOOs of particular interest to biologists. Use them and log on as "guest" to explore these environments.

MooConnect via TelnetRead About
BioMOO bioinfo.weizmann.ac.il 8888 Web Home Page
DU MOO moo.du.org 8888 Web Home Page

Note: depending on your client and platform, clicking on the Connect via Telnet link may or may not open a telnet connection. If not, use the address and port information manually. Also, you can connect to these two MOOs over the web. Details are on their home pages.

WAIS, Gopher, and the Web

Although a few gopher sites still remain, although some have integrated WAIS sites, and although there might still exist some stand-alone WAIS sites, for all practical purposes, these protocols have been completely superceded by the Web, described in the next section.


next up previous contents
Next: Use of ... Course Up: Introduction Previous: ...Internet Services ...


Using The Web

Words and Pictures. Hyperlinks between and within documents. Movies and Music. Online shopping, database access, and even basic attempts at virtual reality. Is there nothing the Web cannot do?

The Web describes information using HyperText Markup Language (HTML) and transmits it using HyperText Transport Protocol (HTTP). The current common name, The Web, is a contraction of its original name, the Word Wide Web, also abbreviated as WWW or W3. A Web browser performs multiple tasks. First, any Web browser is an HTTP client; it knows how to transfer data using the HTTP protocol. Second, any Web browser also knows how to interpret and display HTML, the content markup language used on the Web. Different browsers have different display capabilities and display the same HTML code in different ways (which is why HTML is referred to as a content markup language instead of a page description language) but all of them can understand (parse) HTML and do something reasonable with it.

Some of the differences in the way different Web browsers display the same Web page come from different design decisions ("what font should be used for <H1> text?") and some of it comes from the fact that different Web clients have different capabilities. Some of these differences, such as the ability to display various kinds of still or moving images as part of the Web page or to run programs written in Java, Active X, or Javascript, represent extensions to HTML. These extra capabilities may be built into the browser or may be added by "plugins"; software extensions which give the browser new functionality. Finally, the behavior of a Web browser can frequently changed by configuring its preferences; if you find the default font too small, that can often be increased.

Many new computer users assume that the Web and the Internet are synonymous. However, many protocols other than HTTP flow over the Internet. In part, the new user is confused by the fact that, in addition to supporting extensions to HTML, many popular web browsers have support for other protocols such as email (SMTP, POP, IMAP), newsgroups, ftp, and gopher for example. What this really means is that the particular piece of software (e.g. Netscape Communicator) is more than just a Web client, it is also an email client, an FTP client and a Gopher client. Finally, HTTP does not have to be transmitted over the Internet, and HTML doesn't have to be transmitted via HTTP. Web technology has become a common interface tool for communication between computers on a local network (sometimes called an Intranet), and every Web client I have worked with has the ability to read and display local HTML files.

Because virtually every Web client is also a limited FTP client, many people choose to so use them. In the case where a Web page contains a link to an FTP server, simply selecting the link downloads the file. If, however, you are given the following instructions to retrieve a file:

"The file is available by anonymous ftp.
 ftp to ftp.bcm.tmc.edu
 and retrieve mbcr/pub/file.txt"

...you could accomplish this with your www client by pointing your client to this URL:

ftp://ftp.bcm.tmc.edu/mbcr/pub/file.txt

Problems On The Web

One of the major problems currently afflicting the Web is that of incompatibility. This is particularly unfortunate and ironic in that one of the goals of the original Web development was seamless integration of information resources accross the Internet. One common symptom of this problem is a browser logo and a statement like "Best Viewed with Netscape 4.5". One attempt at solving this problem is the campaign for platform independent Websites.

Another recurring problem is that of security. My major motivations for updating this page from Version 2.01 to 2.5 was that during the two and a half year interim many links had decayed and much of the text became dated. One of the most obvious offenders was a note about a security hole in Netscape version 2.0 with a link to a description of that hole. I fully expected that link to have decayed to nothing. In fact, I found that the page had moved, but astonishingly, at the top of the new page was a report, dated April 1999, of a serious security hole affecting Netscape 4.5! It appears that the authors of this page have made a long term commitment to monitoring security problems on the Web, and I suggest you monitor this site regularly.

Both the 1996 and the 1999 security problems were the result of the ability of Web browsers to execute the Java programming language. As it happens, the advice I gave in 1996 is still relevant:

  1. Keep in touch with the author of your browser and upgrade to new versions as seems warranted.
  2. Consider turning off support for Java in your browser.
  3. Above all, be cautious, alert, and informed.

Some Practical Considerations for Using The Web

  1. It has become increasingly possible to spend money on the Web. Nonetheless, with one exception, what I said in 1996 is still true: it is unlikely that you will spend money by accident. The exception is an important and unfortunate one; if you work for a for-profit institution (company) located outside of Switzerland, according to Geneva Bioinformatics, the owners of SwissProt, you must pay a substantial licensing fee (thousands of dollars a year) for any use of this database. For example, while using the otherwise free service Entrez you might innocently link to and download a SwissProt sequence file. According to Geneva Bioinformatics, you have just become liable for an annual licensing fee, and there is no way you could have known that. Thus, if you are working for a for-profit organization, you should either carefully avoid SwissProt, or else pay the licensing fee. Other than that, you will find that many resources on the Web are free, and those that are not, inform you clearly and carefully before obligating you to pay. (Access to SwissProt is not required for this course.)

  2. In 1996, I wrote: "Although the Internet is becoming increasingly important to biologists, it is still not a sufficient resource for keeping up with biology...With access to a good (or even adequate) science library, you could do without the Internet."
    This is changing. Reasonable people might now differ as to whether the library or the Web is now more important, but most would probably agree that both are now virtually essential for the working biologist. To be clear, physical access to a library is becoming less essential, but belonging to an organization affiliated with a library remains important. Most research is still published in conventional journals. What has changed is that a complete copy (full text and figures) of more and more journals have become available via the web. However, most of these are only available to subscribers. Most working biologists cannot afford personal subscriptions to all of the journals they need to read. However, if your library subscribes to a journal, that usually entitles you to view, download, and print articles from the journal's website.
    I'm certainly not suggesting that libraries and librarians are unimportant. Information Science is an increasingly important discipline, but I think most librarians and users of libraries can agree that the role of libraries and librarians in science is changing rapidly and it is difficult to predict where it is going.

  3. Network resources are not as reliable as one would like. If you select a resource and receive only an error message, the fault is likely to be with the server. It is even the case that if you reach the server and don't receive the results you expect (e.g. search for a common term and get nothing in return) this might be due to the fact that the server is misbehaving rather than you doing the search incorrectly. And finally, what has become an increasing problem is that webpages will either move, be edited to significantly change their meaning (e.g. this page), or become temporarily or permanently unavailable.

  4. Compared to the well developed conventions for referencing data from the paper literature, conventions for acknowledging of Internet resources in a thesis or paper are in utter chaos. Surfing the net for a few months will uncover a number of competing standards for how to accomplish this. For resources available on the web, a URL is my reference of choice. However, given the previous point, I have become cautious about publishing references to digital resources at all.

  5. Just because it is on a computer doesn't mean it is correct. This sounds like a banal truism, but it is startling how good the beauty of computer output can make bad data look. Genbank is loaded with author errors (both in the sequence and comments sections) and probably contains some archivist-introduced errors as well. The same is true of Medline. Any study which assumes perfection in such data will produce an incorrect result.

Finding Web Resources

The Web is vast and disorganized, and the overwhelming majority of what is there is irrelevant to you. Further, the Web changes constantly; new resources appear, old resources become outdated or disappear and the paths and techniques used to access resources change. The bad news is that there is no perfect way of finding the resources that might be useful to you. The good news is that you don't need a perfect way; finding even half the useful resources on the Web is well worth while, and a lot better than ignoring the Web altogether.

I find the following approaches the most useful for identifying biological resources on the web:

  1. Search Engines: There are several websites from which you can launch searches. You type in words that you expect to find on the Web page you are looking for you, and it will return to you a list of pages on the Web that contain those words. The services that I keep links to are:
    Google
    One of the newest search services; still listed as Beta. Provides a much shorter, more focussed list of sites than most other services. This is currently the first search I try.
    Alta Vista
    This used to be my first choice. Very complete, very powerful search facilities. I switched to Google because Alta Vista frequently generates an unuseably large list of sites.
    Others
    Different search services give different results. If one of the above two search services don't find what I am looking for, I try one of the following:
  2. Natural Links: The original dream of hypertext was that natural relationships between topics would be linked. Within the realm of biology, this dream is increasingly coming true. Although such links are not nearly as complete, or as easy to use as one might like, they are improving and already are extremely useful. If one accesses Medline via Entrez, for example, one can (sometimes) link from the bibliographic citation retrieved from Medline to a full text copy of the article on the publisher's website. Similarly, one can link from OMIM to the HUGO gene nomenclature site to GeneCards to my Tumor Gene Database. Such links are not always labeled in the most obvious ways or located on a page or within a site exactly where you would expect them so, for the present, it is important to explore a site creatively in order to find all the valuable links that might be there.
  3. Starting Points: Before one can find and use natural links, one must have a place to start. Pointers to useful resources can come from friends, printed journals, usenet, or from websites that collect such pointers. Paula Burch and the rest of my friends at the Molecular Biology Computational Resource at Baylor College of Medicine maintain a few such lists. Although comprehensive lists of important biological resources used to be valuable resources, it is my impression that these lists are no longer being maintained. Presumably, the explosive growth of the Web in general and biological resources specificially combined with the increased quality of the search services described above killed these lists. I provide the following links largely for historical reasons; I expect them to be gone in Version 3.0 of this chapter.
    1. EXPASY.
    2. Pedro's Biological Research Tools.
    3. USGS NETWORK RESOURCES: BIOLOGY SERVERS.
    4. The World Wide Web Virtual Library: Biosciences
  4. Your own personal bookmark file: Once you find a resource you like, it is important to make a note of it as finding it again can be frustratingly difficult. Thus, it is important to aggressively and actively maintain a "bookmark list". Most Web clients make it relatively easy to do so.

Useful Web Resources

There are a number of resources on the Web that you will use during this course. The ones specific to particular chapters are listed in the next section. Note that these have not been updated since 1996 and thus might contain obsolete links. When the coursebook is comprehensively updated, these references will be moved to the relevant chapters. Listed here are two resources that you will use throughout the course and some other resources that I think you will find generally useful as well.

The BCM Search Launcher

The BCM Search Launcher allows you to launch a number of different kinds of database searches and other sequence analyses using one common front end. To deal with the vast number of options that result, a rather cryptic but powerful series of pages was designed.

For work done in this course, you will select one of the following options from from the BCM Search Launcher home page:

The pages you go to, in each case, will all be similar in structure. Near the top of the page will be a text box into which you paste (or type) the sequence(s) to be analyzed. In the middle of the page are the buttons you use to launch (or clear) the search, and on the remainder of the page is where you describe what kind of search you want to do, implemented as radio buttons. Each radio button represents, in general, a server (although two buttons might represent the same server with different options). For each server, there are a few words of explanation, including where the server is located, and most importantly, three links:

  1. One labeled [H] which links you to a help file for the server.
  2. One labeled [P] which tells you the parameters the search will use.
  3. One labeled [O] which takes you to an alternative page from which to launch the search which allows you to set options to values different from the default.

Entrez

ENTREZ, provides access to Pubmed, Genbank, a comprehensive protein sequence database, a protein 3D structure database, and other related data. Pubmed includes both Medline as well as new publications which have not yet been indexed for Medline. The protein database contains sequences from Swiss-Prot, PIR, PRF, PDB, and translated protein sequences from the DNA sequence databases. In this course, you will primarily use Entrez as a source of protein and nucleic acid sequences. I will discuss this use first, and the use of Entrez to retrieve literature citations second.

When you use Entrez to search the sequence databases, what you search is the comment fields of the sequences, not the sequences themselves. To search by sequence, use The BCM Search Launcher or another similar tool.

Entrez has both basic and advanced search interfaces. The basic search interface is unfortunately well hidden and is mis-named. You, of course, can find this interface instantly using the link above, but if I had not given that to you, you would get to this page by selecting first Literature-Pubmed (which is the page for advanced searches of Pubmed) and then from that page, select the link in the LEFT column named Basic Seach. This simple search page is titled Pubmed, but in fact it allows simpler searches of all the databases. From this page, you select only the database to search (e.g. Nucleic Acid, Protein, Literature) and words to search for. You are then taken directly to the list of records in that database which contain the words you requested. Simply click on the link for the format in which you wish to use to see the sequence. In addition to various formats for the sequence, you also have the option of viewing related records from the same or different databases. These options will be explained below.

The Entrez home page has links in two columns; a skinny blue column to the left and a wide white column to the right. The links to the left are help files and links to other resources. On the right are the links to the various databases. If you select one of these links, you can do advanced searches of the databases. All the databases are searched in basically the same way; you build searches progressively. You perform an initial search and then either decrease (or increase) the number of hits retrieved by adding search terms which are ANDed (or ORed) to what you have.

In the initial Entrez search, you specify which "field" of the database you wish to search (default: all of them), and if you want an automatic search (the default) or if you want to list terms. Search field, e.g. Author Name, Accession Number, should be fairly obvious and in any case the label Search Field is a link to a help file. Note, however, that as of this writing, some of the choices offered in the popup are not documented in the help file and seem not to work. The choice between Automatic and List Terms is more cryptic. Usually the default of Automatic is what you want; Entrez automatically translates the words you type in into terms in its database. However, if you list terms, you are presented with a list of the matching terms in the Entrez database (and their alphabetical neighbors) and can select among them accordingly.

The results of this initial search is a page for refining that search. You can continue to enter terms, and by default they are logically OR'ed one with another to increase the number of records detected. However, at the bottom of the page, all the terms you have entered are listed and you can combine them in various logical ways, replace the default OR of the terms with what you select. Finally, if you are a Boolean wizard, in the upper left corner of the page is a button labelled Details. Clicking that opens a window containing your search as text which you can hand modify any way you like. Once you are happy with your search, you click the Retrieve ## Documents button, and get a list of records as for the basic search.

Searching for a sequence for which you have an accession number is one of the most frequent tasks you will have because journals now require that papers accepted for publication contain an accession number for one of the major databases. Given that you have read a paper and obtained such an accession number, the strategy is:

  1. Select protein, or nucleotide database.
  2. Set Search Field: to Accession.
  3. Set Search Mode to Automatic.
  4. Click the Search button.

This should retrieve one sequence, so on the next page you should immediately click the Retrieve 1 Document button, and then from the list of 1 sequence on the next page, click on the link for the format in which you would like to see the sequence.

Other Web Resources

NCBI (the National Center for Biotechnological Information) is the Microsoft of bioinformatics. It is part of the U.S. Government's National Library of Medicine (NLM), which is one of the National Institutes of Health (NIH). They are doing a lot of work linking existing resources (Medline, Genbank, protein databases, OMIM) and creating some of their own (e.g. Unigene, a resource with gathers together the many sequences, especially EST sequences which correspond to a particular gene. This is the best way to gather all the RNA sequences available for a gene in which you are interested. Unfortunately, Unigene does not currently cover genomic sequences. Thus, NCBI's home page is a valuable resource.

Medline is an on line database of "medical" references. Because "medical" is interpreted extremely broadly (all articles in the journal "Cell" are in there) it is of value to virtually all biologists. It contains a complete literature citation, complete abstract in most cases, and a variety of other useful kinds of information (institution of first author, grant support for the research described and key words). This database is indexed and searchable. Searches normally occur in real time and take seconds. Papers back to 1966 are included, although the earliest references lack abstracts.

Increasing the value of Medline is the fact that it is constructed by professional abstractors who assign key words from a controlled and highly structured vocabulary. Thus, if you can define key words that describe what you are interested in you can be sure that you will not miss articles because the author uses terminology different that you use in your search.

I know of two free versions of Medline, one at the National Library of Medicine and one part of ENTREZ, discussed below. Entrez, described above as a sequence retrieval tool, has a number of advantages and is the one I will discuss.

The ENTREZ interface to Medline, in addition to allowing searches of words in the various fields of Medline (title, authorname, etc.) similar to that describe above for sequences, and in addition to allowing searches for MESH terms like any form of Medline, has tools for helping you search the rather arcane MESH heirarchy and for performing the common task of retrieving a paper for which you have the standard literature citation (journal, volume, page). Finally, the Entez interface to Medline has the added benefit of extensive links to other resources. For many literature citations, a link to the full text of the paper on the publishers website is provided. (You or your library usually needs a subscription to the electronic view of the journal to use this link, however.) Finally, the Medline citation is linked to any nucleotide or peptide sequences reported in the paper. Similarly, sequence records in Entrez are linked back to Medline and links to other relevant databases are being added all the time.

OMIM stands for Online Mendelian Inheritance in Man. Mendelian Inheritance in Man has as its organizing concept human genetics, especially human genetic diseases, but its author, Victor McKusick, takes such a broad view of this topic that this resource is almost a global review of (human) biology. Everyone who cares about biology should have a link to OMIM on their Hotlist/Bookmark List!


next up previous contents
Up: Introduction Previous: Using the Web


Use of World Wide Web Resources Required in this Course


Warning! Material hereafter has not been updated for Version 2.5. Thus, some links in this section may be broken. Because this material relates to other chapters of the coursebook, it will not be updated until the entire book is revised.

Resources Required for Chapter 1: Pairwise Sequence Alignments

Although the BCM Search Launcher includes the various kinds of BLAST and FASTA searches, for Chapter 1 of the course you will be adjusting parameters of these searches which are not normally adjusted, and thus which are not adjustable from BCM Search Launcher. As a result, you will use the additional resources to do BLAST and FASTA searches required for Chapter 1.

The BLAST server is implemented as an HTML form you fill out. As such, it is largely self-explanatory. I do, however, note the following points which might be helpful.

  1. Not all options apply for all searches; choosing the program BLASTN, for examples, makes choice of a matrix irrelevant.
  2. Depending on your browser, some of the data entry fields work in a confusing way. For example, the field in which you specify the maximum number of sequences to return contains a default of 250. When you select it, the field may display 0 even though the 250 is still there. If this happens, you have to remove it by backspacing to enter the desired value.
  3. In order to specify values for S and W you click on the Additional Options: YES radio button. Having done that, you need to specify which additional options you want in the next field as if you were supplying them on the command line to the UNIX-based BLAST program. To find out how to do that, click on the phrase "Additional Options" (which is a link) which will take you to a page explaining that.
  4. One example: For a nucleic acid sequence, the default value for W (the window size) is 12. The value for S (the expected score) is normally calculated from E (the number of chance matches expected from the search) and the other characteristics of the search. Normally one does not change W at all, and changes S by changing E, but for educational purposes, you will be doing so here. To accomplish this, you might type into the field:

    s=50 w=10

    Values of W over 12 (the default for nucleic acids) are not allowed. Further, although the search is performed with non-default values of W, a warning message is generated.

The Fasta server is suggested here for a number of reasons:

  1. It allows you to set ktup.
  2. It is the only server I am aware of that offers the latest versions of FASTA, version 2.0. Version 2.0 offers a number of advantages over previous versions, including being more sensitive and providing a statistical estimate of the probability that each match might have occurred by chance.
  3. Unlike the GenQuest server (which is the server accessed by the BCM Search Launcher) it allows you to do FASTA searches of DNA sequences. (The GenQuest server only allows FASTA searches of peptide sequences.

Use of this server is largely self-explanatory. From the home page, linked above, scroll down the page and pick the (FASTA) search you wish to perform. There are different links for DNA and protein searches, for example.

Unfortunately, the above FASTA server was unreliable at the time of this writing. Thus, this Fasta server is given as a backup. It is not FASTA 2.0 and is rather bare-bones compared to many of the servers on the web. For this server, you initiate your search on the web and the result is emailed to you. The form is mostly self-explanatory, with the exception of one part I found/find confusing; the choice of a library to search. At this point you pick one from a list of cryptic 4 or 5 letter codes from a pop-up. I have been unable to find help for this, but after some thinking and experimenting came to the conclusion that they represented EMBL, new (recent) entries (EMNEW), EMBL, all entries (EMALL), the equivalent for Genbank (GBNEW, GBALL) and EMBL divided up by organism: EVRL = Viral sequences in EMBL, etc.

You will be doing the pairwise alignments for Chapter 1 using the alignment tool built into BioMOO. The reason for this is that we were unable to locate a net-based alignment tool which had the characteristics required for this chapter. You might, however, want to explore the following Pairwise Alignment server.

Resources Required for Chapter 3: Multiple Alignment

SRSWWW is a complex server which has a number of properties which I find confusing. First, there are a lot of apparently similar options which do rather different things. For example, on the home page there is a conventional link named Databanks, and below that a series of buttons labeled "Search sequence libraries", "Search libraries with protein structure information", "Search a library linked to sequence libraries", etc. If you know you want to search PDB, the database of protein structures generated by Brookhaven National Laboratories, which of these should you choose? In general, it depends. For this chapter of the course, the answer is the link to Databanks.

The advantages of SRSWWW are that it is more comprehensive than other servers of this kind, and that it contains relationships between data that are not easily accessible from the original data and which, among other things, allows you to link between databases. To learn how to use the full capabilities of this server is well beyond the scope of this course, but fortunately, there are only two features of SRSWWW which you will need for this course; PDBFINDER and ALI (a.k.a. 3dALI). Ali is used only for a few optional exercises and thus will not be treated in detail. To access it, select the:

       Network Browser for Databanks in Molecular Biology

page. This gives you a list of databases, from which you select ALI. Note that for the ALI database, only searches by ID work. An ALI ID is a cryptic letter code. The easiest way to find a code of interest (if someone has not given it to you) is to do a wildcard search (which you get if you search without entering a search term) and browsing the list of all IDs returned.

The other database you will use in this chapter is a reformatted PDB called PDBFINDER. PDB is a database of protein 3D structures. Each PDB entry contains a list of amino acids with their 3D coordinates which can be converted into a protein structure. Our use of PDB in this course is surprising in that we make minimal use of this structural information. Rather, we mostly use the amino acid sequence stripped of its 3D information. The reason for this is that you will be repeating a published alignment, and the structural information was used in the publication. Thus, the authors referred to the sequences they used by their PDB filenames, and in order to retrieve these sequences, you must do so from PDB.

PDB is not an ideal database from which to retrieve sequences. In the first place, the sequence is imbedded within structural information and it would take a fair bit of work with a text editor (or a program) to cleanly excise the sequence. Further, PDB uses three letter amino acid codes which would have to be converted to 1 letter codes before it could be used by most programs. Use of SRSWWW/PDBFINDER solves these problems for you as when you retrieve a file from PDB finder, the peptide sequence is presented as one (long) line of 1 letter amino acid codes, trivial to excise. In addition, in many cases SRSWWW is able to link the PDBFINDER file to the equivalent Swissprot file.

In summary, to retrieve a sequence from a PDB file requires the following steps:

       Network Browser for Databanks in Molecular Biology

When you perform the above steps the first time you are requested to do so in this chapter, your search will return no results. This is due to another property of PDB which makes it difficult to use, that PDB ID numbers are not guaranteed to persist. It is characteristic of database entries that they can become superseded by newer entries. How to handle this is a major philosophical issue in database design. In Genbank, the ACCESSION field contains the accession number of the current file and following that the accession numbers of all sequences it supersedes. Thus, if you search Genbank with a superseded ID number, you will retrieve the more recent record. This is not the case for PDB. The older ID is contained in the record, but only in a superseded field, which can only be searched using a free text search. Unfortunately, SRSWWW cannot do full text searches of PDB. Thus, another resource you will need in this chapter is the PDB Gopher. In general, if you search for an ID and do not find it, do a free text search for it on this gopher. Use of this gopher is self-explanatory.

Although Entrez contains a copy of the Swiss-Prot database, the copy of the Swiss-Prot database available from the SWISS-PROT Server itself contains sequences that are absent from the Entrez copy. These sequences are expired (that is, they have been replaced by updated versions of these sequences) and thus have been removed from the Entrez version of Swiss-Prot. Since these expired sequences are required for this chapter, you will use the SWISS_PROT server directly to obtain them. Help is available for the SWISS-PROT server.

The following will also be used, but are only described briefly at this point. They will be considered in more detail in the chapter itself.

  1. MSA is a good tool for aligning a small number (e.g. 5) of sequences in order to gain insight into their relationships. This server uses the algorithm described in Altschul, Lipman, Kececioglu (1989).
  2. The BCM Search Launcher, (which we have covered earlier) especially the Clustal facility of that server.
  3. The MaxHom Alignment server[5], which is offered together with structure prediction. This server also has an information facility available.
  4. The All-All related peptide sequences server[5], discussed above.
  5. The AMAS server [5] which you can use to Analyse Multiply Aligned Sequences.
  6. WebLogo Sequence Logo Generation [5], a tool for the Analysis of Multiple Alignments. It requires its input in FASTA format.
Go Back to the Table of Contents

Resources Required for Chapter 4: Mathematical Analysis of Molecular Phylogenetics

The sequences you will need in this Chapter can all be retrieved from Entrez, as is described above.

You will need to do multiple sequence alignments using the CLUSTAL program in the BCM Search Launcher, also described above. Select Multiple Alignments from the first page, and CLUSTAL will be an option on the next page.

The Tree of Life provides a linked phylogeny of species which will be useful for interpreting sequence comparisons. Its use is obvious.

All-All is a server which does "all vs. all" alignments of a list of sequences [6]. By looking at the offered example, you ought to be able to easily figure out how to use this server. Note that all of the sequences are listed one after another in one entry field.


next up previous contents
Up: Introduction Previous: Resources Required ... Course

Back to VSNS BioComputing Division Home Page
VSNS-BCD Copyright
David Steffen
steffen@biomedcomp.com

Valid HTML 3.2!