Sometimes when you solve a problem, it’s helpful to take some notes as you go: book-keeeping. For example, if you had to find a path through a picture of a maze, you might use your pencil to draw some paths as you worked out the problem. If you were solving a Sudoku puzzle, you would mark down some possible numbers in each cell.
A data structure allows information to be organized and stored. Sometimes data structures are used to organize important data (for example, a collection of names and associated phone numbers), and sometimes data structures are used for book-keeping as an algorithm runs.
We’ll look at stacks, queues, and dictionaries.
A stack data structure models a familiar way of organizing objects in the physical world: stacked objects. When you add an object to the stack, it’s added to the top of the stack. When you take an object off of the stack, it’s taken off the top of the stack. In order to access the bottom plate of a stack of heavy plates, you might first have to take off all of the plates above it one by one, starting from the top. We say that the stack is “last in, first out” for this reason.
Here is a visualization of a stack of numbered bricks, written by Dartmouth student Daniel Shanker. Click on a brick to push it onto the stack in the correct order.
So, we need operations to push data items onto the top of the stack, and pop items off the top.
Sometimes, stacks and queues are referred to as abstract data types. In this case, abstract means that (just like in pseudocode) we don’t really care about some details of implementation. If a data structure allows items to be pushed and popped, following a “last in, first out” order, we say that it is a stack. If it quacks like a duck, it is a duck!
There are many ways to implement stacks and queues, and good reasons to choose particular implementations, such as the computational costs of push and pop operations, but we will focus on how stacks and queues can be used in algorithms. Fortunately, Python lists can be used as fine stacks, using the built-inappend
(to push data onto the
back of the list) and pop
methods.Sometimes, it’s nice to make a class describing a data structure. You should not be able to access the interior elements of a stack (that would probably indicate an error in your code) but if you are using a ‘bare’ Python list, you could do that without Python warning you. Here’s an implementation of a Stack class:
When you get in line (or enqueue) for an amusement park ride, you have your spot in line. Nobody can cut in front of you, and you are guaranteed to get out from the front of the line (dequeue) before anybody behind you. Like stacks, queues allow for the ordering of objects while following a certain set of rules. Whereas stacks are “last in, first out,” queues are the opposite - “first in, first out.”
We will need operations to enqueue data items at the end of the queue, and dequeue items from the front. Here is a visualization of a queue (written by Dartmouth student Daniel Shanker), with enqueue and dequeue operations. Try to mimic the following sequence (each character’s name is denoted by the letter next to them, and you can enqueue them by clicking on them): Dave and Eliza get in line to buy movie tickets, followed shortly by Alice. Dave is served. Before Eliza is served, Carol and Betty get in line. At that point there are only two tickets left, so only Alice and Carol are served.
You might notice that for the stack implementation above, the worst-case run-time of appending an item is Θ(n) (although it’s only Θ(1) amortized over many appends). If you dequeue an item a queue by deleting an item from the front of a list, the run-time is Θ(n) in all cases. When Python deletes an item from the front of a list, the interpreter loops over all remaining items, shifting them one index earlier in the list.
An alternate implementation would use a data structure called a doubly-linked list for either one, allowing Θ(1) operations to add or remove items from the beginning or end of the list. (Doubly-linked lists are described in a later chapter.) Perhaps that is how the Python deque class is implemented. I’d recommend using a deque if you need a queue or a stack.
Objective: write a class that implements a data structure by wrapping a simple data type.
There may be many different ways to implement a data type such as
Queue. We define the Queue based on the operations available on the
Queue, not based on how it is implemented: a Queue is a data type that
supports enqueue
and dequeue
operations. It’s
convenient to have a class that handles the details of the
implementation.
Write a Queue class that internally uses a list to keep the data in
order. Your Queue class should have methods to enqueue
an
item, dequeue
and return the item, empty
which
returns True
if there are no more items on the Queue, and
__str__
which returns a string representation of the
queue.
(Hint: none of the bodies of these methods needs to be more than one
line long.)
Here is a solution; solve the problem yourself before checking the solution.
A Python list allows each value to be accessed by an integer index. A dictionary allows each value to be accessed by an index, or key that might not be an integer – for example, a string.
We’ll consider an example problem where dictionaries are useful: document similarity using word frequency histograms. Before that, let’s look at how we might use an ordinary Python list to do something similar, using the indexing of a Python list in a clever way.
Objective: Implement an efficient algorithm that counts the frequency of numbers in a list.
You have a lot of digits stored in a list, digits
. Maybe
the digits are a secret code! How frequently does each digit appear?
Create a new list, frequency
with indices 0 through 9, and
implement a linear-time algorithm such that when run,
frequency[0]
contains the number of times that 0 appears in
digits
, frequency[1]
contains the number of
times that 1 appears, and so forth. (Hint. You should need to loop over
the digits
list exactly one time.)
Here is a solution; solve the problem yourself before checking the solution.
The frequency-counting trick is pretty nifty, but seriously, how often do you encounter a bunch of numbers and want to count how frequent each is, unless you work for a three-letter government agency?
Here’s similar motivating example problem, though, used by web search engines to find web pages.
The idea is, given a document, count how many times each word appears. Some words appear frequently in all documents, such as a, the, and, of. Let’s ignore those. But if the word blood appears many times, then the document might be about medical issues. It might also be a Halloween story. If the words patient and illness also appear many times as well, we expect it to be even more likely to be a medical document.
That’s why we might wish to count how many times each word appears. We could then presumably compare our frequency histograms for two documents to get a measure of how similar the two documents are.
We might like to use the same trick with list indexing to count word frequencies as we did to count digit frequencies, but unfortunately, we can’t index into a list with words the same way we can index into a list with digits. However, we can index into a Python data structure called a dictionary using a string as the index, or key, to find a corresponding value.
You create a dictionary using braces {
and
}
, whereas you create a list using brackets [
and ]
. Let’s create an empty dictionary first:
my_dictionary = {}
We can access items in the dictionary using square bracket notation,
just as we do for lists. For example, if we want to set the value of the
dictionary item with key "blue"
to be 460, we could do it
like this:
my_dictionary["blue"] = 460
Notice that a dictionary never gives a “list index out of range”
error. If the item associated with the key "blue"
was not
in the dictionary, a new entry in the dictionary is created by the above
line of code.
Just like a list, a dictionary can store any type of Python value:
my_dictionary["blue"] = "A color with wavelength of approximately 460 nm."
A formal description of a dictionary is that a dictionary is a data
stucture that stores relationship between key-value pairs. The
key is the item used to index into the dictionary
("blue"
, in our example), and the value is
the item stored in the dictionary (first, 460, and then
"A color with wavelength of approximately 460 nm."
).
We might imagine several ways that dictionaries might be implemented.
Most of the ways you might imagine first would work, but would have
expensive (in terms of time) methods to actually get the value for a
particular key. For example, you might imagine a dictionary internally
storing two lists, let’s say keys
and values
.
To find a value for a particular key, you could loop over
keys
to find the index j
of the right one (so
that the key you want is in keys[j]
), and then find the
value in values[j]
. This strategy is a form of linear
search, and so it has a worst-case time cost of Θ(n), where n is the number of items in the
dictionary. This is not how dictionaries are implemented in
Python. In fact, Python uses a clever implementation that allows
indexing into the dictionary to almost always take constant time, though
in the worst case, which occurs very rarely, it can take Θ(n) time.
Sometimes, it is useful to be able to loop over dictionary items. If
you use the for ... in ...
syntax, the items looped over
will be the keys of the dictionary. What if you want the values
too? Well, if you have a key, it is easy to get the value. So we can do
something like this:
Although here the output appeared in the order in which the items were placed into the dictionary, the order in which keys appear in a for-loop over a dictionary is undefined. All you know is that you’ll get all the keys in some order.
It can be useful to check whjether a particular key is in the
dictionary, since an error will be generated if you ask for the value
associated with a key that is not in the dictionary. You can use the
special in
operator to test; it gives back a boolean value,
like this:
(There’s also a not in
operator, which gives the
opposite of in
.)
By the way, Python lists have the same operator:
primes = [2, 3, 5, 7, 11]
print 5 in primes # prints True
print 6 in primes # prints False
Objective: Make use of a dictionary to count frequencies.
The following code loads the words of the Constitution of the United States into a list. Some of the words are interesting, in that they contribute to the uniqueness of the document, and others are not. For example, the word “the” appears frequently in the document, but is not interesting. Create a dictionary that contains all words of length greater than four as keys, and the frequency of each of those words as associated values.
Here is a solution; solve the problem yourself before checking the solution.
Here is the basic idea of how dictionaries work. When a dictionary is created, Python allocates space in memory to store the dictionary, and creates what is essentially a Python list in that space. The size of this list is called the capacity of the dictionary. When the dictionary goes to find the value associated with some key, the first thing that happens is that some numeric value is computed for the key; no matter what type the key is—a number, a string, an object reference, you name it—some numeric value is computed based on the key. That numeric value is used to index into the Python list associated with the dictionary. Since both the computation of the value and the Python list indexing take constant time, we can locate the value in constant time. (Usually. We’ll come back to this shortly, in the subsection below on “buckets and linked lists.”)
The first step of indexing into the dictionary by key is converting the key to some numeric value, called the hash code. To convert the key to a hash code, Python uses a hash function that takes the key as input, and somehow generates a number. Let’s look at a hash function that computes a number based on a key that is a string.
This function uses a couple of features of Python that are new to
you. First is the assert-statement. Here, we give the Python keyword
assert
, followed by a boolean expression. If the expression
evaluates to True
, we just go on to the next statement. But
if the expression evaluates to False
, then we get an
AssertionError. For example, my_hash_function
has an
assertion that checks whether the type of its parameter key
is a string. If it’s not, we gat an AssertionError like this:
Traceback (most recent call last):
File "/Users/thc/PycharmProjects/cs1proj/hash_code.py", line 15, in <module>
print my_hash_function(99)
File "/Users/thc/PycharmProjects/cs1proj/hash_code.py", line 2, in my_hash_function
assert type(key) == str # check if the key is a string
AssertionError
You have seen the built-in function type
, which returns
the type of its argument, and the built-in global variable
str
. But the other new feature of Python here is the
built-in function ord
, which takes a string of length 1 and
returns an integer value that is the ASCII code for the character in the
string. For example, because the ASCII code for a
is 97,
ord("a")
is 97. Similarly, ord("b")
is 98. We
add up all the ord
values of the individual characters in
the string to get some number.
Some hash functions are better than others. The one computed by
my_hash_function
is actually pretty bad. There are two
properties we would like a hash function to have:
The function we just wrote could work as a hash function, but it is bad from both perspectives. First, does the code distribute keys evenly throughout the table? No. Most strings will have multiple letters, and I would expect small indices in the table to be used infrequently. Most words are limited in length, however, and so very large indices in the table will not be used at all.
Second, does the function minimize collisions? No. Two words that are anagrams (you can rearrange the letters in one to get the other) will have the same hash code. For example, the following anagrams will have the same hash code: “opts”, “post”, “pots”, “spot”, “stop”, “tops”. And there are many other words that we expect to have the same hash value using this function, such as “pat” and “let”.
Python has a better hash function built in. You can call it using
hash
. For example:
The number is genearated by hash
is frequently large;
before using the value, the dictionary actually discards several bits
from the bit representation of the number in memory, to ensure that the
index is less than the total capacity of the hash table. We will not go
into the details of how the Python hash function works, but you are
welcome to look it up on the web if you are curious.
No matter how clever the hash function, sometimes two different keys hash to the same hash code value: a collision. What happens? It turns out that values are not always stored directly at indices of the table found from the hash function. Instead, if there are collisions, the slot in the hash table contains the address of a linked list. The linked list contains nodes. Each node contains an item that should be stored in this position in the table.
We’ll see linked lists soon, but in the worst case, the linked lists mean that the running time for accessing a key-value pair is Θ(n). It could happen that the hash function put many of the items in the same location in the table. Choosing a good hash function and ensuring that the capacity of the table is large enough makes this outcome extremely unlikely. Usually, the linked lists will be very short, leading to a constant time for indexing items.
If you were wondering where this term “hash” came from, it’s because a good hash function will “make hash” of the key, slicing and dicing it into a numerical representation.