3.1 Introduction

This chapter examines one of the most important uses of computers: the storage and retrieval of information. While the general organization of this book is “top-down,” working from end-user applications, down through the various levels of software, towards the physical hardware of the computer, the topic of information storage and retrieval will be approached in the opposite direction – from the physical hardware level to the end-user level.

Our discussion of information storage and retrieval focuses on four distinct “views” of the process, or “levels” at which it takes place: the hardware level, the file organization level, the programmer level, and the end-user level. These four views are illustrated in Figure 3.1

Section 3.2 examines information storage and retrieval from the point of view of what happens at the hardware level. The discussion focuses on disk drives, which are the most popular form of mass storage device.

Section 3.3 introduces the file concept and explores three different alternatives for storing files on disk. These alternative storage techniques are known as: contiguous, linked, and indexed storage.

Section 3.4 looks at information storage and retrieval from the point of view of the application programmer. While details of the software development process, including programming, are deferred until Part III of the text, this section will give you a feel for how programmers manipulate data files.

Finally, Section 3.5 examines how data appears to the end-user. The two most common end-user views of data are discussed: the view of the file system provided by the operating system, and the view of data provided by relational database systems. This second view will be explored in depth, due to the importance of relational database systems as end-user applications.

3.2 The hardware level

A disk drive is an electromechanical device for storing and retrieving data. Disk drives contain one or more disks, which are platters that rotate inside the drive unit. These platters are coated with a special magnetic material that allows information to be recorded on the disk by manipulating the orientation of the magnetic fields on the surface of the platter. Data stored on a disk is said to be in a non-volatile form, meaning it will not be lost when power to the drive is turned off.

Please display the course, quarter, and sequence number for the courses Oneal M.B. has taught.

Figure 3.1: Four views of information storage and retrieval

Almost all disk drives these days are hard disk drives that consist of a sealed unit that contains one or more rigid metallic disk platters that spin at a high rate of speed. You may occasionally run across an older type of disk drive, known as a floppy disk drive.Floppy disk drives were similar to hard drives but stored data on a single removable diskette that was spun at a much lower rate of speed than hard drive platters. A diskette, or “floppy disk” as it was sometimes called, consisted of a single flexible platter (made of thin plastic covered with a magnetic material) encased in a hard plastic shell. A diskette could hold a maximum of 1.44 megabytes of data, or about 1.5 million characters. Floppy disks had their heyday in the 1990’s.

The platters inside a hard disk drive spin at a much higher rate of speed than floppy disks – typically at either 5,400 or 7,200 RPM (Revolutions Per Minute). The result is that hard drives can hold many times the information of floppy drives and access that information much faster. For example, in 2012 an entry-level hard drive had a capacity of about 500 gigabytes, which is about 350,000 times the capacity of a 1.44 megabyte floppy drive, or approximately 500 billion characters.

Regardless of whether a disk is hard or floppy, it will be divided into tracks and sectors. Tracks are concentric circles, kind of like the rings of a tree, onto which data may be written. Each track is divided into a number of equal sized regions known as sectors. Sectors are the smallest addressable unit on a disk drive. Data is both written to and read from disk drives an entire sector at a time.

INSERT FIGURE 3.2

Figure 3.2: The organization of a disk drive

Figure 3.2 contains an illustration of the organization of a disk drive. This particular disk spins in a counter-clockwise direction and has three tracks; the outermost track is labeled A, the middle track is labeled B, and the innermost track is labeled C. Each track contains eight sectors, labeled 1 through 8. Hence, this disk contains a total of 24 sectors.

In order to access information at this level, one needs to know the track and sector number that specifies where the data is to be read from or written to the disk (e.g., Track B, Sector 8). This track and sector information is sent to the disk drive “controller,” which oversees the actual input or output operation.

Real disk drives contain many more tracks per disk platter and many more sectors per track than illustrated in Figure 3.2, but this diagram illustrates the basic organization of a drive. Note that while all of the sectors on a particular track are of the same size, the size of a sector varies from track to track, with the size of the sectors on the innermost track being substantially smaller than the sectors on the outermost track. Regardless of this fact, each sector of a disk holds the same amount of data – to accomplish this feat; data is written more densely on the smaller, inner tracks than on the larger, outer tracks.

In addition to the disk platter itself, drive units contain a mobile access arm. The access arm is a mechanical device that can move back and forth (left and right in the diagram) over the disk. At the end of the access arm is a small electromagnetic device called a read/write head which, as its name implies, is capable of either reading data from, or writing data to, the disk.

Although this set up is the same for both floppy drives and hard drives, one difference between the two involves how the read/write head and disk surface interact. On floppy disk drives, the read/write head was in physical contact with the surface of the disk. On hard disk drives, the head floats on a layer of air just slightly above the disk surface but close enough (less than the diameter of a human hair) to still pick up the magnetic impulses. This requirement for the read/write head of a floppy disk drive to be in physical contact with the disk platter was one of the reasons why floppy disks were slower and much less reliable than hard disk drives.

You may have heard of the phrase “head crash” used to describe a type of problem that can occur with hard disk drives. A head crash occurs when the read/write head of a hard drive actually comes into contact with the disk and thereby damages the surface of the platter. Head crashes are very serious problems, which can destroy a drive and its data. Thankfully, these types of problems rarely occur on modern hard drives.

The process of reading, or writing, a particular sector of a disk consists of moving the access arm so that the read/write head is placed over the track which contains the sector of interest, then waiting for that sector to rotate under the read/write head, and finally either detecting the magnetic fields representing the data (in the case of reading) or recording new magnetic fields (in the case of writing). This process results in three types of delays, or latencies, when reading or writing data to a disk drive:

• seek time – The amount of time required to move the read/write head over to the appropriate track.

• rotational latency – The time required for the start of the desired sector to rotate underneath the read/write head once the head reaches the proper track.

• transfer time – The time required to transfer the data from the disk sector to the computer’s main memory during a read (or from the computer’s main memory to the disk sector during a write). The limiting factor in this delay is the time it takes for the complete sector to rotate past the read/write head.

The actual time to access a particular piece of information will depend on the number of tracks the access arm has to move across and the position of the desired sector relative to the read/write head once the desired track is reached. Average disk access time can be computed as the sum of the averages of each of the three delays.

Avg. Disk Access Time = Avg. Seek Time + Avg. Rotational Latency + Avg. Transfer Time

Of these three delays, average seek time tends to be the largest (about 10 milliseconds) and average rotational latency the next largest (about 5 milliseconds). Transfer time is usually quite small in comparison to the other two. Thus a hard drive on a modern PC can usually perform an I/O operation in under 20 milliseconds (0.02 seconds). Although this may seem very fast to a human, it is quite slow compared to a CPU capable of processing billions of instructions per second. In fact, I/O (Input/Output) is the single largest bottleneck in computing speed.

3.3 The file organization level

Imagine the difficulty of trying to keep up with a large amount of information if you, the end-user, had to remember the track and sector location of every item stored on a disk. For example, if we wanted to keep up with the student records for a university, we would probably have to write down the track and sector of every student’s record. Such a system would be hardly better than keeping the records on paper. Common tasks, such as finding all of the students majoring in Computer Science, would be near impossible, essentially requiring a person to retrieve every student record in the entire system in order to determine who was majoring in what field. Such a system would obviously be unacceptable to most people.

Computer scientists and engineers have solved this problem by constructing levels of software that reside between the end-user and the disk drive controller. These levels of software hide the low-level details of where information is physically located on the disk enabling data to be organized and accessed in more meaningful ways.

INSERT FIGURE 3.3

Figure 3.3: The fields of a student record

Three useful concepts for organizing logically related data are: files, records, and fields. A record is a collection of information about an individual person, object, or thing – for example an employee record for John Dough, or a student record for Suzy Queue. Records are usually composed of a group of fields, where a field is a single item of information, such as a name, age, or address. A file is a collection of logically related records. So, for example, we speak of the student information file, or the employee file.

Being able to refer to data using these logical groupings is much more convenient than being forced to think in terms of tracks and sectors, but how are these concepts actually implemented? Typically, all of the fields that make up a record are placed in a fixed order and stored in physically adjacent locations. In other words, all of the parts of a record, (e.g., name, age, major, etc.) are stored next to one another. Figure 3.3 illustrates one possibility for organizing student records. These records consist of eight fields that occur in the order: name, age, major, ID, sex, address, city, and state. The number of characters (bytes) required to represent each field is given in the figure. The total length of a student record corresponding to this organization will be 83 bytes. An actual record that follows this organization is shown in Figure 3.4. The title of each field, and its position in the 83-character-long record, is indicated. Since “spaces” are real characters that must be stored in the record, they are represented in Figure 3.4 using the • symbol.

INSERT FIGURE 3.4

Figure 3.4: An example of a student record

As previously stated, files are collections of records. Just as the fields of a record are arranged in some order, the individual records that make up a file are also generally arranged in some logical order – such as alphabetically by student name.

Data files come in all sizes, from very large to very small. Furthermore, the size of the records in one file may be quite different from the size of the records in another file. Disk drives, on the other hand, read date from and write data to equal-sized sectors. Given these facts, the question that naturally arises is how can the file concept be mapped to the physical tracks and sectors of disk drives.

To being with, since disk sectors tend to be much larger than individual records, groups of records are joined together into blocks, which are read from, and written to, disk as a unit. In order to minimize the amount of wasted space on a disk, record blocks are constructed to be as close as possible to disk sector size.

For example, if the sectors of a disk were designed to hold 1,024 bytes (characters) and our student file consisted of records that were each 83 bytes long, 12 records could be stored per block. These blocks would then be 996 bytes long (12 X 83), meaning that 28 bytes per sector (1,024 – 996) would be wasted. The amount of wasted space per sector varies depending on record and sector size. In general, the amount of wasted space per sector will be about one half the size of a single record.

So, the records of a file are grouped together into blocks, which are about the size of a sector. If the file is relatively small, all of its records might fit within a single disk sector. It is more often the case that many sectors will be required to hold an entire file. When this happens, we are faced with the problem of deciding which disk sectors should be devoted to holding the file and in what order they should be thought of as occurring.

There are three common approaches to storing multi-sector files on disk drives: contiguous storage, linked storage, and indexed storage.

3.3.1 Contiguous storage

The simplest file organization technique is contiguous storage. Contiguous storage places blocks of records that are logically adjacent to one another into physically adjacent sectors. For example, let’s say that we have a file that consists of 20 student records that we’d like to store in alphabetical order based on last name. The first 12 records could be grouped into one block and the last eight records could be grouped into another block. Contiguous storage would place these two blocks in physically adjacent sectors, such as sectors 2 and 3 of track A.

Figure 3.5 presents a collection of files stored using the contiguous approach, and illustrates their track and sector locations on a diagram of a three track, 24-sector disk.

INSERT FIGURE 3.5

Figure 3.5: A collection of contiguous files stored on a disk

An advantage of contiguous storage is that an entire file can be retrieved rather quickly. Once the access arm has positioned the read/write head over the appropriate track and the beginning of the first sector of the file spins under the head, there will be no additional seek or rotational delays associated with reading the file. In other words, since the entire file lies in contiguous sectors of the same track, once the read/write head is positioned at the beginning of the file, their will be no more need to move the access arm or wait for sectors to rotate under the head.

While speed of retrieval is a significant advantage for contiguous storage, unfortunately it has a severe disadvantage as well. Contiguous storage can be quite wasteful of space. The reason for this is simple. For a file to be stored on disk it is not only necessary that an adequate number of free sectors be available, it is also imperative that the sectors be located next to one another.

In Figure 3.5, ten out of 24 total sectors are unused, yet a three-sector file cannot be stored due to the fact that the necessary number of contiguous sectors is not available. To most people this situation would be unacceptable. How would you feel if you had a disk that was only 58% full, yet refused to store a file that was less than one third the size of “available” space? Rather miffed, I would think.

INSERT FIGURE 3.6

Figure 3.6: A collection of contiguous files after defragging

A disk is said to be fragmented when files are spread throughout the disk, leaving the free space in small scattered “clumps”. One way this fragmentation problem can be overcome is to occasionally defragment, or “defrag”, the disk by moving the location of files so that all of the free sectors can be placed together. Performing a defragmentation operation can be time consuming. Essentially, every file on the disk must be examined and the contents of many files copied to new sectors. Figure 3.6 shows how the disk of Figure 3.5 might look after being defragged. Because the process of defragging a disk can literally take many hours to complete, contiguous storage is generally avoided.

3.3.2 Linked storage

Linked storage overcomes the disadvantage of contiguous storage by doing away with the need to place logically adjacent file blocks in physically adjacent sectors. To accomplish this, Linked storage places two things in each sector: (1) a block of records, and (2) the track and sector number of the sector containing the “next” block of records. Thus a sector not only contains records to be read, but also information on where the next block of records may be found. Because the location of the next sector is stored in the current sector, there is no need for the sectors of a file to be physically located next to each other. Hence, the contents of a file may be spread across many tracks and sectors

The process of reading a linked file involves accessing the file’s first sector by moving the access arm to the proper track and then waiting for the appropriate sector to rotate under the read/write head. That sector is then retrieved yielding a block of records and the track and sector location of the next block. If the next sector is on a different track, the access arm will position the read write head over that track. Regardless of whether or not switching tracks is necessary; the drive unit must wait for the correct sector to rotate beneath the head. The second sector is then read and this process repeats itself until the entire file has been accessed. The last sector of a linked file will have a special tag in its next field indicating that the end of file has been reached.

Figure 3.7 illustrates a collection of six files stored via the linked method. Each file is composed of one or more sectors, where each sector contains a block of one or more records together with the address of the next sector. For example, file Alpha requires two sectors. The first sector, sector two of track C, holds a block of 12 records and the address of the sector storing the next block of the file. The second sector, sector two of track A, holds 8 records. The area that would normally point to the next sector of the file contains the “end of file” indicator.

Notice that the number of records stored per sector varies from file to file. The reason for this is that the sizes of records differ from one file to another while the sector size is fixed for a particular disk drive. If it seems odd that record sizes vary, remember that record size is determined by the amount of data the file’s creator decided to store per item. An employee record is unlikely to be the same size as a student record, which is probably not the same size as a credit card record.

INSERT FIGURE 3.7

Figure 3.7: A collection of linked files stored on a disk

As stated above, the primary advantage of the linked file representation is that it uses space efficiently. Linked storage allows us to make use of all of the sectors of a drive. It would be quite easy to add a three, five, or even 10 sector file to the disk of Figure 3.7 – without the need to perform a costly defragmentation operation. Of course, it is true that there is some overhead associated with devoting a portion of each sector to storing the address of the “next” sector, but this overhead is very small. In fact, for the simple three track, eight sector disk we are using in our illustration, a single byte (character) per sector would be more than sufficient.1

The major disadvantage of linked storage is that the time to retrieve a file can be significantly greater than what is required under contiguous storage. The reason is that seek time (to move the access arm) and rotational latency (to wait for the sector to rotate over to the read/write head) may be experienced for each sector of the file, rather than for just the first sector – as was the case with contiguous storage.

While defragging a disk, in order to make space to store a file, is not necessary under linked storage, the defragmentation operation can still be performed occasionally to speed up file retrieval time. This is because in addition to grouping all of the free sectors of a disk together, defragging linked files involves placing the individual sectors of the files into contiguous locations and updating their “next sector” pointers. After defragging a disk, each of the sectors of a file will point to the very next physical sector.2 In this case, retrieval of a linked file can take place just a quickly as retrieval of a contiguous file. Of course, as files are added, modified, and deleted, the disk will slowly become fragmented again – resulting in increasing file retrieval times – until another defrag operation is performed.

Figure 3.8 shows the contents of a collection of linked files after defragging. This figure is very similar to Figure 3.6, which illustrated defragging contiguous files. A careful inspection of the two figures will show that after defragging the files are “contiguous” in both cases, yet they are not located in exactly the same sectors. The figures were drawn this way in order to illustrate that there is nothing special about which sectors contain the files. Any sectors will do, as long as they are physically adjacent to one another.

Before leaving the topic of linked files, one additional question needs to be addressed: how is the first sector of each file determined? As we have seen, once the disk drive begins reading a linked file, the current sector will always contain the location of the next sector of the file. But this leaves open the question of how the initial sector is located in the first place. Somehow the system must know where each file begins. Furthermore, this is true regardless of whether the file is linked or contiguous.

INSERT FIGURE 3.8

Figure 3.8: A collection of linked files after defragging

INSERT FIGURE 3.9

Figure 3.9: A Root Directory Table for the files of Figure 3.8

This problem is usually resolved by having a special location on the disk (such as Track A, Sector 1) set aside to hold a Root Directory. The Root Directory is a table that records the name and initial sector of every file that resides on a disk. This table is often loaded into the computer’s memory (RAM) when the machine is booted or the disk is first accessed. Given this table, it is a simple matter to retrieve any file since its initial track and sector are known. Figure 3.9 presents a root directory table that could be used for the disk of Figure 3.8.3

As an interesting side note, you may have heard that sometimes files can be recovered even after they have been “deleted” from a disk. The reason for this is that when an operating system erases a file, it deletes the file’s entry from the root directory and marks as “free” the sectors that had been devoted to storing the file. This is done in order to save time. It is much easier (and hence faster) to simply change one entry in a table, rather than overwriting the contents of every block of a file with “blanks”, or some other character. Data recovery programs scan through the entire disk a sector at a time retrieving the data stored in sectors that have been marked “free”. In order to truly delete sensitive data, a disk drive should be reformatted and overwritten with new data.

3.3.3 Indexed storage

The final type of file storage technique we will examine is indexed storage. Under indexed storage, the first sector of a file contains an index table that holds the track and sector number of all other sectors of the file. In order to read an indexed file; this first sector must be retrieved – by referring to the file’s entry in the root directory. The index table is then used to determine the locations of the other sectors that comprise the file. These sectors can then be retrieved one at a time. With indexed storage there is no need for the sectors that actually hold the record blocks to also contain “next sector” addresses.

While it should be clear that indexed storage would work – in the sense that organizing data in this way would provide all of the required information to allow files to be stored and retrieved – it would probably not be used very often if it had no clear advantage over linked storage. This is true because indexed storage suffers from the same retrieval time problems as linked storage and has the additional disadvantage of requiring an extra data block to hold the file’s index table.

However, indexed storage is in fact very popular because in addition to supporting the sequential access provided by both contiguous and linked file storage techniques it also allows for fast access to the individual records of a file based on a “key”.

Most information retrieval transactions require that a particular record be retrieved (e.g., an employee record, or a student record) given that record’s key (e.g., an employee name or student ID number). With either contiguous or linked storage the only way to get a particular record, such as the record with key 55555519, is to walk through the file one sector at a time until the block with that record is encountered. If the file were long, say a 10,000 record student file, this process could be very time consuming.

INSERT FIGURE 3.10

Figure 3.10: An illustration of the indexed file storage technique

Indexed retrieval is able to overcome this problem in the following way. A key is associated with each record of the file and the individual records are ordered according to those keys. For example, a file of student records could be sorted based on a student ID key. Records are blocked together in the standard way and written onto disk a sector at a time. However, a copy of the key to the last record in a block will be stored in the index table along with the track and sector address of that block.

The index table of a student file composed of 20 records, sorted by student ID number, and stored four records per block is illustrated in figure 3.10. Once the Root Directory has been loaded into memory, only two read operations are required to retrieve any particular record from the students file: one read to retrieve the file’s index table, and a second read to retrieve the file block containing the record of interest.

The advantage of indexed storage can be seen by examining the number of reads required to access the record of student number 55555519. Under either linked or contiguous storage, all five sectors of the file must be read in order to retrieve this record, compared to the two reads required of indexed. Of course, with indexed storage a program will have to scan the index table once it has been retrieved from disk, in order to determine which sector contains the desired record, and this will take some time. But since computers can perform billions of instruction per second, this time will be negligible in comparison to the disk access time.

The larger the size of a file, the greater the savings realized by using the indexed storage technique. Imagine a file of 10,000 student records stored 10 records per block. Such a file would require 1,000 sectors (and a disk much larger than the one we have been using in our illustrations). In order to retrieve the last record in the file, 1,000 individual read operations would be required when using either linked or contiguous storage, as opposed to only two operations when using the indexed approach. Hence, we would expect indexed retrieval of this particular record to take place approximately 500 times faster than linked access.

To summarize, contiguous storage is the simplest and fastest technique for reading an entire file in sequential order due to the fact that all of the records of the files are stored in physically adjacent sectors minimizing the seek and rotational latencies. This advantage is offset by the need to perform costly defragmentation operations in order to keep free sectors grouped together. Both linked and indexed storage allow all of the sectors of a drive to be used for storing files since they do not require that the data blocks making up a file be physically located in adjacent sectors. Linked storage is somewhat more space efficient than indexed, since indexed requires one additional data block per file to hold the file’s index table. The primary advantage of indexed storage over linked and contiguous is that it enables access to individual records in a file using only two read operations. To improve retrieval times when reading linked and indexed files sequentially defragmentation can be performed.

3.4 The programmer level

The previous two sections of this chapter described the low-level details of how files may be stored on disk. Normally, humans do not have to concern themselves directly with such issues since they are handled by the operating system program (in the case of file organization issues) or the disk drive controller (in the case of hardware-oriented issues).

This section provides an overview of how data can be accessed by the computer programs written by application programmers. Application programmers are the people who develop the programs run by end-users. Hence, this section is concerned with the lowest level of data manipulation that is commonly performed by humans.

While details of the software development process are deferred until Part III of this book, a brief summary of the file-oriented commands that allow computer programs to access, create, and modify data files are presented. A small example program that illustrates how these commands can be used to solve a simple problem is also included, along with a few words about how the operating system supports the programmer’s view of data.

In order to prevent the present discussion from becoming bogged down in irrelevant details, the “programming language” used in this section is a generic language that illustrates features common to most general-purpose programming languages. While every general-purpose procedural programming language, such as Java and C++, will include commands similar to those presented here, the exact syntax and semantics of those commands will vary from language to language.

3.4.1 File access commands

There are six basic commands used by programmers to manipulate data files: three to gain or release access to the file itself (create, open, and close), two to transfer data out of and into the file (read and write), and one to test whether all of the records have been processed (end_of_file). Given these commands, along with the other standard components of a programming language, it is possible to write computer programs to answer any particular question about the data stored in a file.

The open command is used to gain access to a file that currently exists. Every file maintained by the operating system is identified by a name and location. The location of a file is normally specified by providing a “path” to the directory that contains the file. For example, a file named “Students” located in the “UniversityData” directory on the machine’s “C” drive, could be referred to as:

C:\UniversityData\Students

Since referring to a file by its full path and name can be somewhat cumbersome, an “internal” program name is usually associated with the “external” operating system path and name. The format of the open command is:

Open (program file name, operating system file name, access type);

For example,

Open ( InFile, “C:\UniversityData\Students”, read_only );

is a valid command that will associate the name “InFile” with the aforementioned “Students” file located in the “UniversityData” directory of the “C” drive and then open this file for “read_only” access.

A file’s “access type” specifies the input / output operations that the program will be allowed to perform on the file. Two of the most common access types are “read_only” and “read_write”. Programs with “read _only” access to a file are permitted to read the contents of the file but are not allowed to modify the file in any way. Programs with “read_write” access to a file can read the records already stored in the file and also have permission to write new records to it. The open command given above, would allow a program to read the records stored in the “Students” file but not to modify the file in any way.

Attempting to open a file that does not exist results in a “run-time exception”, or error.

For example, if the operating system could not find the “Students” file in the “UniversityData” directory on the “C” drive when attempting to execute the above command, then “InFile” could not be opened and an error condition would occur.

The two most common reasons for this error are: (1) the programmer mistyped the path or name of the file, or (2) someone accidentally deleted or renamed the file.

The create command is very similar to the open command, except that it is used to construct a new file, rather than to gain access to an existing one. The create command has the format:

Create (program file name, operating system file name, access type);

The statement:

Create ( OutFile, “C:\NamesOfAllCSMajors”, read_write );

instructs the operating system to create a new file, to be located on the “C” drive and named “NamesOfAllCSMajors”. This file will be called “OutFile” within the program, and its access permission is set to “read_write” to allow the program to add records to the file. While it is technically possible to create a file with an access type of read_only, such a move would make little sense, since it would be impossible for the program to place any records into that file.

Attempting to create a file that already exists results in a runtime exception for the program. This is a safety feature which helps prevent a program from accidentally wiping out an existing file by writing a new file over it.

The last of the three commands for gaining and releasing access to a file is the close command. This command is a “housekeeping” command that releases all of the operating system resources, such as memory, that were used to access the file. The format of the close command is:

Close (program file name);

The following command could be used to close the “Students” file, which was associated with the name “InFile” in the open command:

Close (InFile);

Transfer of data between disk files and internal program variables is accomplished with the read and write commands. In the case of simple sequential files, the read command accesses records one after another in the order they appear in the file, until the end of the file is reached. Similarly for sequential files, the write command appends new records onto the end of a file. In a sequential file, records can only be placed at the end of the file and never between existing records.

The formats of the read and write commands are:

Read (program file name, variable);
Write (program file name, variable);

The first argument is the program name for the file and the second argument is the variable which data is transferred to or from. For example, suppose we had an internal variable named StudentRecord that was patterned after the student records of Figure 3.3. A record from the “Students” file could be transferred to this variable by the following command, assuming that “Students” had been associated with the name “InFile”:

Read (InFile, StudentRecord);

Similarly, a student name, stored in the program variable StudentRecord.Name, could be written to the output file, “OutFile”, with the following command.

Write (OutFile, StudentRecord.Name);

The final file manipulation command we will look at is a command to test whether the end of file has been reached. The format of this command is:

End_of_file (program file name);

This command either returns a value of TRUE (indicating the end of the file has been reached) or FALSE (indicating that the end of file has not yet been reached).

3.4.2 A simple example

In order to illustrate how the six file access commands presented above can be used to solve a simple information retrieval problem, a small program is developed in this section. This program determines the names of all students majoring in computer science by accessing a file of student records organized in the manner presented in Figure 3.3

The program will work in the following way. First, the file containing student records will be opened. Second, a file, which will eventually contain the names of all CS majors, is created. Next, the program will begin to read records one at a time from the student file, stopping when the end of that file is encountered. As each student record is read, its “major” field will be tested to determine whether it is equal to “CS”. If so, the “name” field of the current student record will be written to the CS majors’ name file. Once all of the records in the student file have been examined, both the student file and the CS Majors’ name file will be closed.

The program’s intended behavior can be expressed by the following “pseudo code”.

Open the student records file
Create a new file to hold the names of all CS majors
While we haven't reached the end of the student records file
Read the next student record from the student records file
If the current student’s major is “CS” then

Write that student’s name in the names file

After examining all of the student records, close both files

Figure 3.11 contains the actual program command sequence. Note that the program opens the input file in read_only mode since the original “Students” file will not be modified in any way. The output file is created in read_write mode, since the program will modify it. The While loop and If statements are discussed in Part III of this text, but their intent should be clear from the above English and pseudo code descriptions.

Open (InFile, “C:\UniversityData\Students”, read_only);

Create (OutFile, “C:\NamesOfAllCSMajors”, read_write);

While Not (End_of_file (InFile))

{

Read (InFile, StudentRecord);

If (StudentRecord.Major == “CS”)

Write (OutFile, StudentRecord.Name );

}

Close (OutFile);

Close (InFile);

Figure 3.11: A portion of a program to generate a list of the names of all CS majors

Before the widespread adoption of relational databases, computer programs, such as the one above, had to be written and debugged in order to answer almost any question about the data maintained by an organization. While this was often more expedient than trying to answer the question by going through the data by hand, it was, nevertheless, far from an optimal solution. Section 3.5.2 examines relational databases, the end-user application that has overcome many of the problems associated with the storage and retrieval of large quantities of data.

3.4.3 Buffering and direct access

The discussion of data access at the hardware level described how disk I/O occurs in fixed size blocks, a sector at a time. The discussion of the file organization level explained the need to group the individual records of a file into blocks, in order to efficiently use disk space. The present discussion of data access at the programmer level has, so far, ignored blocks and sectors.

Why? The reason for this is that file blocks and disk sectors are essentially “invisible” at the programmer level.

Application programs access files on the record level. As illustrated above, they read and write data a record at a time. This “record at a time” access is possible because the operating system automatically blocks and unblocks records for the program using an area of memory called a buffer.

When an application program first executes a read command, the operating system will send a message to the disk drive controller requesting the track and sector containing the first record be read. The contents of this data block will then be copied into a memory buffer. The operating system removes the first record from the buffer and passes it to the application program. When the program executes its next read command, the operating system will first check the buffer to see whether it contains the requested record. If so that record is removed from the buffer and passed on to the program. This process continues until the buffer becomes empty, at which point the operating system reads the next data block of the file from disk.

Writing records to a file works in much the same way. Records are initially “written” to a buffer. Once that buffer becomes full, the operating system creates a new buffer for the program and then writes out the contents of the old buffer to an appropriate track and sector of the disk.

The operating system hides all of the details about blocks, buffers, and tracks and sectors from the application program. This allows the application programmer to think of data strictly in terms of files, records, and fields.

Finally, a few words should be said about direct file access. When we studied the indexed method of file storage we noted that it could be used to retrieve individual records based on a key field. Operating systems that support this type of file organization technique generally allow “direct access” files to be accessed, modified, and created at the application program level. The read and write statements associated with direct access files require a “key” be specified. When a read is executed, the record matching the specified key is returned to the program, assuming such a record exists in the file. Otherwise, an error message is sent to the program.

As with sequential files, the operating system handles all of the details of blocking and unblocking records. In the case of direct access files, however, the operating system must perform a number of additional tasks. For example, when a read is executed, the target file’s index table is searched to determine the address of the track and sector containing the block where the record should be located. That block is then retrieved and placed in a memory buffer. Next, the buffer is searched for the requested record. If the record is found, the operating system returns it to the program. If the record cannot be found, an error message is sent to the program, indicating the record is not in the file.

Again, all of these details are hidden from the application programmer. From the point of view of the application program, a record with a particular key is requested and the record either shows up or a message indicating that it could not be found is returned.

3.5 The end-user level

This section addresses how data appears to the end-user of a computer system. End-users generally access data in two ways: through the operating system interface, and through end-user applications. The first part of this section describes how end-users can access and organize data via the operating system. The second part examines, in some detail, an application designed specifically for manipulating large quantities of data, the relational database.

3.5.1 The operating system interface

Most operating systems implement the concept of a file, or document. A file is a collection of data stored under a unique name. Files may be grouped together into directories, or folders. A directory is a collection of related files and subdirectories. The file system is the basic tool for organizing the programs and data stored in a computer. It allows people to place related pieces of information “near” one another.

An example directory structure is outlined in Figure 3.12. Files are indicated by “document” icons, and directories by “file folder” icons. The data is organized into a single main directory “letters”. Letters is subdivided into two subdirectories: “personal” letters and “business” letters. Personal letters includes letters to “Mom” and to “Pat,” your significant other, plus one letter to your Cousin It. Note that “CousinIt” is a file, or document, containing a single letter, while “Pat” is a directory containing multiple letters. If you were planning to write more than one letter to Cousin It, you would probably make a separate directory to hold these letters, as was done for Mom.

INSERT FIGURE 13.12

Figure 3.12: Overview of a file system

It is important to understand that while a file system is a tool for organizing data, it does not do the organizing for you. A human must decide which directories to create, what to name them, and what files and subdirectories belong in them. In a well-organized file system, the directories will contain files and subdirectories that are logically related to one another. Properly used, a file system can make it much easier to find and retrieve important data; improperly used, it can make the task very difficult.

The user interface is another major characteristic of operating systems. From the view of the human user, the operating system’s user interface defines the basic behavior of the computer. Most modern operating systems, such as Apple’s OS X, Microsoft’s Windows XP, and Linux include a graphical user interface, or GUI (pronounced “gooey”). A GUI uses small graphical images, called icons, to represent programs, data files, directories, and other resources. In Microsoft Windows, double-clicking the mouse pointer on the icon for a directory or disk drive opens a window that will display the contents of that directory or drive. From this window, the user can double-click on a subdirectory icon to open it. GUI’s make it relatively easy to navigate any reasonably well-organized file system simply by “clicking” your way through it.

Once a particular file has been located, we generally want to view, and perhaps modify, its contents. In order for the contents of a file to be displayed in a meaningful manner, the file must be accessed by an appropriate application. Modern operating systems associate files of a particular type with application programs capable of displaying and manipulating data of that type. Double-clicking on a file’s icon causes an appropriate program to be launched and the file’s data to be loaded into that program.

For example, if a file contains an image in JPEG format, then that file must be opened by a program capable of viewing JPEG images. A file containing a text-based document, such as a memo or report, must be opened by a different kind of program, a word processor. Associating particular application programs with the file types they recognize, enables the operating system to launch an appropriate program when a file’s icon is double-clicked. Figure 3.13 shows what my computer’s desktop looks like after double-clicking the icon for “component_view.pdf” which is a Portable Document Format file associated with Adobe’s Acrobat Reader program. Windows launches Acrobat Reader and then automatically loads the data file “component_view.pdf”.

INSERT FIGURE 3.13

Figure 3.13: A PDF file viewed with Adobe’s Acrobat Reader

In addition to navigating the file system and launching applications, the operating system also supports a number of file system “maintenance” operations. The end-user accessible file system operations include:

Deleting a file
Creating a copy of an existing file
Moving a file to a new location
Creating a new directory
Removing an existing directory

Note that creating a new file is not included in this list – even though that operation is supported by the operating system. The reason for omitting file creation from the list of end-user accessible file system operations is that file creation is performed by programs – not end-users directly interacting with the operating system. (If this distinction doesn’t make any sense to you, think about it this way, you create files on your PC by using the “Save” option of an application – not by some drag-and-drop desktop command or system menu.)

In most modern operating systems each of the end-user accessible operations can be performed in two ways: either graphically by selecting options with the mouse and dragging icons from one screen location to another, or via typed command. Communication with older operating systems, such as DOS and early versions of Unix, was limited to typed commands. This style of interface, called a “command line interface,” is no longer popular, since typing is slower than pointing and it is often difficult to remember the names and exact form of all of the commands understood by the operating system.

In the Microsoft Windows family of operating systems, a file can be deleted by dragging its icon over to the “Recycle Bin” icon and dropping it there. A directory can be deleted in the same way – by dropping its icon over the “Recycle Bin”. Occasionally the recycle bin needs to be emptied by right-clicking the mouse over it and selecting “Empty Recycle Bin”.

Files are moved from one directory to another in Windows by dragging the file’s icon from one directory’s window to the window associated with another directory. Copying a file works similarly, only in this case the control key, “ctrl”, should be held down on the keyboard while the icon is dragged from one window to another. Copying can also be performed by right-clicking the file’s icon, selecting “Copy” from the popup menu, then selecting “Paste” from the “Edit” menu of the target directory’s window.

Creating a new directory is accomplished by selecting “New” and then “Folder” under the “File” menu of the window associated with the directory where this new directory is to be located. When created, the new directory will be named “New Folder”, but can be changed by simply typing the desired name and hitting the enter key.

Each of these operations can also be performed by entering one of the following text-based commands.

Deleting a file
del filename
Creating a copy of an existing file
copy from_filename to_filename
Moving a file to a new location
move from_filename to_filename
Creating a new directory
mkdir directoryname
Removing an existing directory
rmdir directoryname

These kinds of file system commands are supported by every general-purpose operating system – be it Apple’s OS X, Linux, Free BSD, Windows, or DOS. What varies from operating system to operating system is the exact form of these commands and the details of how one interacts with the graphical user interface to issue the commands – assuming a graphical user interface is supported, DOS for example has no GUI.

The popularity of particular operating systems rise and fall over time. In days gone by, DOS, Mac OS, and Window 3.1 ruled the PC market. Today, DOS is pretty much a dead OS. The current Windows family of operating systems (i.e., Windows XP, Windows 7, and Windows 8) bears little resemblance to Windows 3.1. While Apple’s OS X interface is a clear descendent of the original Mac OS, it too incorporates substantial changes. Given this history, it would be surprising if a decade from now operating systems haven’t changed substantially from their current form.

Although it is certainly important for you to understand how to interact with the particular operating system currently run by your computer, it is equally important that you understand the kinds of operations supported by all existing operating systems – especially since these operations are likely to be supported by future operating systems for years to come. While the details of how these operations are performed do and will vary from system to system, knowing what to expect will accelerate the learning process.

3.5.2 Relational Database systems

Database programs, such as Microsoft Access, are popular end-user applications that allow people to organize large quantities of data in ways that make it easy to answer questions about that data.

For example, a database for a university should make it easy to answer questions such as:

Who is currently majoring in computer science?
What are the names and student ID numbers of the students who took Dr. Carpenter’s offering of Math 241 in the spring of 2013?
Who taught CS 100 in the fall of 2012?
Where is Dr. O’Neal’s office located?

INSERT FIGURE 3.14

Figure 3.14: The “Faculty” relation – part of a university database

In order to answer these kinds of questions, a university database must maintain information on faculty, students, courses offered, and grades earned.

3.5.2.1 Relations, attributes, and tuples

All modern database programs are based on the relational database model, which organizes logically related data into tables. These tables, which are known as relations, are subdivided into rows and columns. Each column, or attribute, of a relation keeps up with a particular kind of data. Each row, or tuple, of a relation holds all of the data about a particular entity or event.

Figure 3.14 contains the “Faculty” relation from a hypothetical university database. This relation (table) has six attributes (columns): “Fname” which stands for the faculty member’s name, “Dept” the department in which the faculty member serves, “Office” the building code and room number where the faculty member’s office is located, “Phone” the faculty member’s phone number, “SSN” the faculty member’s social security number, and “Salary” the faculty member’s nine-month academic-year salary. In addition to these six attributes, the “Faculty” relation contains seven tuples (rows).

Each attribute of Figure 3.14 holds a particular type of information, like phone numbers or salaries. Each tuple contains all of the different kinds of information known about a particular individual.

Figure 3.15 presents another relation from our university database. This relation is the “Students” relation, which keeps up with all of the various kinds of information maintained about students. The “Students” relation consists of eight attributes: “Sname”, “Age”, “Major”, “ID”, “Sex”, “Address”, “City”, and “State”; and twenty tuples, one for each student attending this (quite small) university.

INSERT FIGURE 3.15

Figure 3.15: the “Students” relation – part of a university database

By now you may have noticed the similarities between files with their records and fields, and relations with their tuples and attributes. In fact, the tuples of a relation are quite similar to the records of a file, and attributes are analogous fields. However, you should not assume that every relation in a database will actually be stored as a separate file on a disk drive. In fact, you don’t even need to worry about files and records when using a relational database, since the application will handle all of those details for you. Instead, try to think in terms of tables (relations) with their rows (tuples) and columns (attributes).

Now that we have some idea of how data is organized in the relational model, the obvious question is: “How can relational databases be used to answer questions about the data contained in the relations?”.

The relational model provides three query operators:

Select
Project
Join

A query may be thought of as a question posed to the database. Query operators act on existing relations and always produce new relations as output.

3.5.2.2 The “select” operator

The select operator is used to retrieve particular rows, or tuples, from a relation that satisfy some test. For example, let’s say you wanted the complete records for every student majoring in computer science. This could be accomplished with the following query:

Select from Students where Major = “CS”

The relation resulting from this query is presented in Figure 3.16. Notice that this new relation contains the same attributes as its parent relation, “Students”. However, only the tuples that satisfied the condition Major = “CS” are reproduced.

In this text, select queries will be specified using the following general format:

Select from relation where attributeoperatorvalue

The fields “relation” and “attribute” stand for the names of actual relations and attributes. The field “value” will be replaced by either a number or string of characters; depending on the type of data stored in the named attribute.

The “operator” field will be replaced by a comparison operator, such as:

< less than
≤ less than or equal to

> greater than
≥ greater than or equal to

= equal to

The last three fields of the select command (attribute operator value) form a condition test. This test will be either unambiguously true or false for each tuple of the relation being queried. Those tuples for which the condition is true are copied to the resulting answer relation. Tuples for which the condition is false are not copied.

INSERT FIGURE 3.16

Figure 3.16: Results of Select from Students where Major = “CS”

When comparing a number (numeric value) to an attribute like “age” that also contains numbers, the comparison operators work in the expected way. For example, 15 < 17 is true, while 28 > 28 is false. Comparisons involving character strings are made relative to lexicographic, or dictionary, order. Hence, “A” comes before “B”, which comes before “C”, etc. Under this scheme, “<” is interpreted to mean “precedes” and “>” means “follows”, while “=” means “the same as”. So, “A” < “B” is true, as would be “A” < “Apple”. The expression “B” > “C” would be false.

While this approach for handling character strings certainly makes sense, it sometimes takes a bit of getting used to. For example, given a relation that contains grade information represented as “A”, “B”, “C”, etc., the condition to select all grades higher than “C” would be

Grades < “C”

not Grades > “C” due to the fact that comparisons are done using lexicographic order where “A” and “B” precede “C”.

One final point about “select”, which really applies to all relational operators. Since “select” produces a relation, it is often useful to give this relation a name so that it can be referred to elsewhere. The format used in this book for doing so is:

Relation ← relational_query

For example, the relational expression:

Louisiana_Students ← Select from Students where State = “LA”

creates a table named “Louisiana_Students” of all students who reside in Louisiana. The Louisiana_Students relation could be used in other queries, such as the following, which generates a table of male students who live in Louisiana:

Male_Louisiana_Students ← Select from Louisiana_Students where Sex = “M”

In summary, the “select” operator creates a new relation that consists of selected rows, or tuples, of an existing relation. The rows are selected based on the outcome of a condition test. Remember to use the “select” operator when you are interested in retrieving “complete records” about a subset of the entries stored in a relation.

Exercises for Section 3.5.2.2

Using the relations presented in Figures 3.14 and 3.15, develop relational queries to solve each of the following problems.

Generate a relation, based on the Students relation of Figure 3.15, that contains entries for all students under age 21.
Generate a relation, based on the Faculty relation of Figure 3.14, that contains entries for all faculty members in the Math department.
Generate a relation, based on the Students relation of Figure 3.15, that contains entries for all female CS students.
Generate a relation, based on the Students relation of Figure 3.15, that contains entries for all male CS math majors who are between the ages of 19 and 23, inclusive.

3.5.2.3 The “project” operator

Sometimes we are interested in retrieving particular columns, or attributes, of a table, rather than retrieving particular rows, or tuples. Perhaps a list of faculty names and phone numbers is needed; or a list of student names and addresses.

The project operator4 is used to retrieve particular attributes, or columns, from a relation. For example,

Project Fname, Phone from Faculty

would produce the relation shown in Figure 3.17. This newly generated relation contains an entry for each row, or tuple, in its parent relation, “Faculty”. However, only the specified attributes, namely “Fname” and “Phone”, are included.

The general format used in this book for the “project” query is:

Project attribute_list from relation

where the field “relation” will be replaced with the name of an existing relation, and “attribute_list” will be replaced by a comma separated list of attribute names.

Figure 3.18 presents another relation from our university database. This relation, the “Courses” relation, contains information about the university’s course offerings. The relation would contain an entry for every section of every course ever offered. Each row, or tuple, of this relation includes the name of the faculty member who taught the course, the sequence number of the particular section, the name of the course, the quarter and year offered, and the number of credit hours the course is worth.

INSERT FIGURE 3.17

Figure 3.17: Results of Project Fname, Phone from Faculty

INSERT FIGURE 3.18

Figure 3.18: The “Courses” relation – part of a university database

Given the “Courses” relation, a list of course names and the credit hours they are worth can be generated by the following relational query:

Project Course, Credits from Courses

The results of this query are presented in Figure 3.19.

Note that the relation of Figure 3.19 contains only eight tuples, whereas there are eleven tuples in its parent relation, “Courses”. The reason for this discrepancy is that duplicate entries have been eliminated. Relations are generally considered to “sets” of tuples, and as such, duplicate entries are not allowed. “Project” operations often generate tables that contain some duplicate rows, but most relational database packages will delete these duplicates automatically.

INSERT FIGURE 3.19

Figure 3.19: Results of Project Course, Credits from Courses

As with “select”, the relation produced by a “project” operator can be assigned a name so that it can be referred to in subsequent relational operations. For example,

Names_and_addresses ← Project Sname, Address, City, State from Students

would place student names and address information (street address, city, and state) into a relation named “Names_and_addresses”.

In summary, the “project” operator creates a new relation that consists of a subset of the columns, or attributes, of an existing relation. Every tuple of the parent relation will contribute to the projected relation, but duplicate rows will be deleted. Unlike “select”, there is no condition test associated with “project”. Remember to use the “project” operator when you are interested in retrieving particular fields of a relation.

Exercises for Section 3.5.2.3

Using the relations presented in Figures 3.14, 3.15, and 3.18, develop relational queries to solve each of the following problems.

Generate a list of the names of all students enrolled in the university.
Generate a list of the names of all courses taught at the university.
Generate a list of the names of all faculty employed by the university.

3.5.2.4 Queries involving both “select” and “project”

At the beginning of this section a number of questions were posed, typical of those that might occur in a university environment. These questions included:

Who is currently majoring in computer science?
Who taught CS 100 in the fall of 2012?
Where is Dr. O’Neal’s office located?

It was then claimed that a database system should make it possible to easily answer these questions.

So far we have discussed two of the three relation database query operators: “select” and “project”. These two operators can be used together to construct tables that answer very specific questions about the data stored in a relation.

For example, to answer the question “Who is currently majoring in computer science?”, we can construct a relation that contains the names of all CS majors. All that is required is the following sequence of two relational queries:

CS_Majors ← Select from Students where Major = “CS”

Names_of_CS_Majors ← Project Sname from CS_Majors

The first command creates a table, named CS_Majors, that is based on the “Students” relation but only contains those tuples that contain information about CS majors. The second statement projects the name attribute from the “CS_Majors” relation, producing a table that contains only the names of CS majors.

It is interesting to note that during the discussion of the programmer’s view of data in Section 3.4, we constructed a program that essentially answered the same question. That program, presented in Figure 3.11, is ten lines long and requires knowledge of file systems and programming concepts in order to be developed.5

The other questions listed above can be answer by constructing a series of relational commands that use both “select” and “project”. For example, “Who taught CS 100 in the Fall of 2012?” can be answered with the following sequence of commands:

CS100 ← Select from Courses where Course = “CS100”

CS100_2012 ← Select from CS100 where Year = 2012

CS100_Fall_2012 ← Select from CS100_2012 where Quarter = “Fall”

CS100_Fall_2012_Instructors ← Project Fname from CS100_Fall_2012

The first “select” gets information on CS 100 courses. The second “select” narrows this down to CS 100 courses taught in 2012. The third “select” narrows the data even further to the sections of CS 100 taught during the fall quarter of 2012. The final statement projects only the faculty members’ names, so that the resulting relation answers the original question: “Who taught CS 100 in the Fall of 2012?”. For our example university database, the final relation consists of a single entry “ONeal M.B.” Given different data, the final table might have contained multiple entries, since it is possible that a number of different instructors could have taught separate sections of CS 100 in the fall of 2012.

While the above example is easy to understand, writing separate “select” statements for every attribute condition we need to express can become cumbersome. Many database programs allow you to “and” together a number of different condition tests, where all of the tests must yield true for a tuple to be selected. The first three statements of the above example can be reduced to a single expression by using such a “compound select”. Hence, the question of who taught CS 100 in the Fall of 2012 could be determined by the following two relational expressions.

CS100_Fall_2012 ← Select from Courses where Course = “CS100” and
Year = 2012 and
Quarter = “Fall”
CS100_Fall_2012_Instructors ← Project Fname from CS100_Fall_2012

You may have noticed that all of the examples in this subsection applied the “select” operator before the “project”. A common error made by students who are just learning to write queries that involve multiple relational expressions, is to perform the “project” too early, or to leave out necessary attributes when crafting a needed projection.

For example, let’s say that in trying to determine the names of CS 100 Fall 2012 instructors, I began by first making a list of instructor names and then tried to narrow the list to those instructors who taught CS 100 in the Fall of 2012. In other words, I took the following approach:

Instructors ← Project Fname from Courses

CS100_Fall_2012_Instructors ← Select from Instructors where ???????

The problem that I would encounter is that the table with the instructor names does not contain the proper fields to allow me to narrow the list any further – it is just a list of names and nothing more. However, you should not conclude that placing “projects” before “selects” is necessarily wrong. The following sequence of relational expressions, which does begin with a “project” statement, is an acceptable, albeit somewhat longer, solution to the problem.

Course_Data ← Project Fname, Course, Quarter, Year from Courses

CS100_Fall_2012 ← Select from Course_Data where Course = “CS100” and

Year = 2012 and

Quarter = “Fall”

CS100_Fall_2012_Instructors ← Project Fname from CS100_Fall_2012

This solution concentrates first on determining the attributes needed to solve the problem, rather than the tuples needed to do so. Such an approach makes a lot of sense when dealing with tables that contain large numbers of attributes, since it helps us to focus on which parts of a relation are relevant to the question being answered.

Exercises for Section 3.5.2.4

Using the relations presented in Figures 3.14, 3.15, and 3.18, develop relational queries to solve each of the following problems.

Where is Dr. O’Neal’s office located?
What course did Dr. Kurtz teach in fall 2013?
What are the names of all male math majors who are between the ages of 19 and 21, inclusive?
What are the names and telephone numbers of faculty in the math department?
List the student numbers (but no other information) of female CS students.
List the salaries of computer science faculty members (but, to protect confidentiality, don’t include any other information in the table).

3.5.2.5 The “join” operator

The final relational query operator is “join”. The join operator is used to combine information from multiple tables together into a single table, based on some attribute that the tables share in common. The number of columns (attributes) in the resulting table will always be the sum of the number of attributes in both tables minus one. The reason for this is that all attributes in the two source tables will be included in the resulting table, but since one of the attributes is shared, it will only be listed once.

The number of rows contained in the joined table will vary depending on the type of join and the data present in the tables. The most common type of join is an “Equijoin”. In an equijoin, the shared attribute in one row (tuple) of the first table must exactly match the same attribute in some row of the second table in order for a tuple (formed from these two rows) to be included in the resulting table.

An example will make this clear. Consider the two tables, Alpha and Beta, shown in Figure 3.20. These two tables share attribute “Z” in common and so may be joined. Since Alpha has three columns and Beta also has three columns, the resulting table will have five columns (3 + 3 – 1). Every tuple of Alpha will be compared to every tuple of Beta. Assuming we are performing an equijoin, those tuples in which Alpha.Z equals Beta.Z will be joined together to form a tuple in the resulting relation. So, for example, since the “Z” in row one of Alpha matches the “Z” in row one of Beta, they will form a tuple in the combined relation. Row one of Alpha with row two of Beta also forms a tuple. As does row three of Alpha with row four of Beta. Since no other tuples satisfy the specified condition, the relation resulting from this join contains only these three tuples.

INSERT FIGURE 3.20

Figure 3.20: A join of relations Alpha and Beta over attribute “Z”

The join of Alpha and Beta over equal Z values could be expressed as :

Join Alpha and Beta where Alpha.Z = Beta.Z

Further, if we agree that whenever we use the word “join” we really mean “equijoin” the expression can be simplified to:

Join Alpha and Beta over Z

The general format used to specify joins in the remainder of this chapter is:

Join relation1 and relation2 over shared_attribute

where the fields “relation1” and “relation2” will be replaced by the names of existing relations and “shared_attribute” will be replaced by the name of the attribute they share in common.

3.5.2.6 Queries involving “select”, “project”, and “join”

The power of join really becomes apparent when it is used with the other two relational operators: select and project. For example, let’s say we want to produce a listing of the instructor contact information for every course offered in the spring of 2013. The listing should include for each course: the course name, section sequence number, instructor’s name, office location, and phone number.

Reviewing the three relations that make up our database: Faculty (Figure 3.14), Students (Figure 3.15), and Courses (Figure 3.18); we see that instructor names, office locations, and phone numbers are stored in Faculty, while course names and sequence numbers are stored in Courses. In other words, the information we want is not stored in any single table of the database, but is instead spread among multiple tables. Upon further examination we might also note that the two relations, Faculty and Courses, share an attribute in common: faculty names. Because Faculty and Courses share a field in common, they may be joined. The relational expression to perform this operation is:

Rel1 ← Join Courses and Faculty over Fname

The relation produced by this expression, Rel1, will contain all of the attributes of both Courses and Faculty (with the exception that Fname will be listed only once). Hence, there would be 11 attributes in the resulting relation: Fname, Seq_no, Course, Quarter, Year, Credits, Dept, Office, Phone, SSN, and Salary. Due to the fact that the Fname field in each row of Courses will match up with one (and only one) Fname entry in the Faculty relation, Rel1 will contain exactly the same number of tuples as Courses, eleven.

INSERT FIGURE 3.21

Figure 3.21: A relation, Rel4, together with the expressions that generated it

Rel1 contains all of the information we are interested in, but it also contains a large number of rows and attributes we are not interested in – rows containing information about classes offered in quarters other than spring 2013, and attributes such as faculty social security numbers and salaries. Thus, we need to construct a relation that contains only the relevant data.

Reviewing the problem statement, we were asked to provide contact information on all classes offered during the spring of 2013. The following two select statements produce a relation with the relevant tuples.

Rel2 ← Select from Rel1 where Year = 2013

Rel3 ← Select from Rel2 where Quarter = “Spring”

Rel3 is nearer the mark since it does not contain any extraneous rows. However, it still contains a number of attributes that were not requested, such as the aforementioned Faculty social security numbers and salaries. The following project statement produces the final relation:

Rel4 ← Project Course, Seq_no, Fname, Office, Phone from Rel3

This sequence of relational expressions is summarized, and the resulting relation shown, in Figure 3.21.

When we began our discussion of relational databases we posed a number of example queries of the type that could be asked against a university database. One of those questions was:

What are the names and student ID numbers of the students who took Dr. Carpenter’s offering of Math 241 in the spring of 2013?

INSERT FIGURE 3.22

Figure 3.22: The “Grades” relation – part of a university database

If you attempt to answer this question using the Faculty, Students, and Courses tables, you will quickly discover that they do not contain the requested information. There is information on Dr. Carpenter and her Math 241 offering in the spring of 2013, but nowhere in the three tables are the courses actually taken by students specified.

The association between students and the courses they have taken is provided by the final relation of our university database, the Grades relation, which is shown in its entirety in Figure 3.22. This rather lengthy table contains three attributes: ID, Seq_no, and Grade; and has an entry for every course taken by every student. The ID attribute, which represents student ID numbers, is shared by both Grades and Students, and thus provides a link between those two tables. Likewise, a link is provided between Grades and Courses via the shared “Seq_no” field, which represents the sequence numbers of individual course offerings or sections.

Using the Grades relation it becomes possible to find the names and ID numbers of the students who took Dr. Carpenter’s spring 2013 offering of Math 241.

In general, when trying to answer a query involving information from a number of different tables (such as this one) it is important that you be fully aware of the types of information stored in each of the tables and aware of the fields that they share in common. The shared fields are critical, since they are used to “work your way” from table to table collecting the information you are interested in along the way.

For example, to discover the students who took a particular class, it is first necessary to obtain the sequence number of that class from the Courses relation. Once we have the course sequence number we can select the rows of the Grades relation that match that sequence number, giving us the ID numbers of the students who took the class. Using those student IDs, we can retrieve the names of the matching students from the Students relation.

This approach for determining the names and ID numbers of the students in Dr. Carpenter’s spring 2013 Math 241 course can be summarized in the following way:

Determine the course sequence number of the indicated course.
Determine the Student ID numbers of the people taking that course.
Retrieve the student information on each of those students.
Project out the name and student ID fields from those student records.

Step one can be accomplished via the following compound select statement:

R1 ← Select from Courses where Fname = “Carpenter J.” and
Course = “Math 241” and
Quarter = “Spring”and
Year = 2013

We now have a relation that contains a single tuple – the one concerning Jenna Carpenter’s spring 2013 offering of Math 241.6

Although relation R1 consists of the one tuple from Courses that concerns the course of interest, all of the fields, save one, are not really needed. This is because the only piece of information we need from Courses is the sequence number of Dr. Carpenter’s course, which is “100010”. To keep the relations we are working with from becoming cluttered with unnecessary fields, let’s project out this single attribute.

R2 ← Project Seq_no from R1

Given the course sequence number, we can now construct a relation containing only the grade records of the students who took the course. This can be accomplished by the following relational query:

R3 ← Join Grades and R2 over Seq_no

The contents of relation R3 is shown in Figure 3.23. As you can see, R3 consists of all of the grade records for the course with sequence number “100010”. While we needed the course sequence number to produce this table, the only information contained in the table that is currently of interest to us is the student ID numbers. Hence, we can project out this column:

R4 ← Project ID from R3

INSERT FIGURE 3.23

Figure 3.23: Relation R3 – the student numbers and grades of all students
who took Math 241 in the spring of 2013 under Dr. Carpenter

We can now begin the third step outlined above: retrieving the student records for the individuals whose ID numbers appear in R4. Doing so will involve a join of R4 with the Students relation over ID.

R5 ← Join Students and R4 over ID

Relation R5 contains the records of those students who took Dr. Carpenter’s Math 241 course in the Spring of 2013. The final step in solving the problem involves projecting out only those fields we are interested in; namely Sname and ID.

R6 ← Project Sname, ID from R5

The final table resulting from this series of relational expressions is shown in Figure 3.24.

Although the rows of this table are shown sorted by both last name and student ID number, relational database theory treats relations are unordered sets of tuples, so the order of the rows in the displayed table is technically unpredictable. Most commercially available database applications, however, allow the user to easily sort the rows of a table into any order he or she desires.

INSERT FIGURE 3.24

Figure 3.24: Relation R6 – The names and ID numbers of all students
who took Math 241 in the spring of 2013 under Dr. Carpenter

Exercises for Section 3.5.2.6

Using the relations presented in Figures 3.14, 3.15, 3.18, and 3.22, develop relational queries to solve each of the following problems.

Produce a listing of the grades given by Dr. O’Neal (“ONeal M. B.”) in all of his courses. The relation you produce should contain only the student names, their grades, the name of the course in which they made the grade, and the quarter and year of offering.
Create a relation containing only the name, office location, and phone number of the instructor who taught the course with sequence number “100004”.
Generate the fall 2012 course schedule for student number “55555510”. The schedule should consist of course names, sequence numbers, and instructors.
Produce an academic transcript for “Walker J.”. The transcript should consist of the course name, quarter, year, and grade earned for every course taken by Mr. Walker.
The Chair of the CS program is interested in how well his students are doing. Generate a table of student names, courses taken, and grades earned by every CS major in winter 2013.
Produce a list of the names of students who earned one or more A’s in winter 2013.
What is the SSN of the only professor who gave “Kleinpeter J.” a B in fall 2012?

Footnotes

1A byte is eight bits. Two of these bits could be used to indicate which track: 00 for A, 01 for B, and 10 for C. Four bits could be used to represent the sector number as a standard binary number. Binary numbers are discussed in detail in Chapter 11.

2Except, of course, for the very last sector of each file which will still contain an “end of file” indicator.

3Note that in reality the root directory is just one part of an overall file system, such as FAT32 or NTFS, that serves to organize the disk in order to enable file access and manage space that has not yet been allocated.

4Pronounced “pro -ject”, as in “projector”.

5Actually, Figure 3.11 does not contain a “complete” program, only the main parts of one. The entire program would be substantially longer.

6If Dr. Carpenter had taught multiple sections of Math 241 during the Spring 2013 quarter, multiple tuples would be listed in this relation – one for each section. However, the wording of the original question makes it unlikely that she was teaching multiple sections of that class that quarter.

Return to top

DATABASE LAB