Sometimes when you solve a problem, it’s helpful to take some notes as you go: book-keeeping. For example, if you had to find a path through a picture of a maze, you might use your pencil to draw some paths as you worked out the problem. If you were solving a Sudoku puzzle, you would mark down some possible numbers in each cell.
A data structure allows information to be organized and stored. Sometimes data structures are used to organize important data (for example, a collection of names and associated phone numbers), and sometimes data structures are used for book-keeping as an algorithm runs.
We’ll look at stacks, queues, and dictionaries.
A stack data structure models a familiar way of organizing objects in the physical world: stacked objects. When you add an object to the stack, it’s added to the top of the stack. When you take an object off of the stack, it’s taken off the top of the stack. In order to access the bottom plate of a stack of heavy plates, you might first have to take off all of the plates above it one by one, starting from the top. We say that the stack is “last in, first out” for this reason.
Here is a visualization of a stack of numbered bricks, written by Dartmouth student Daniel Shanker. Click on a brick to push it onto the stack in the correct order.
So, we need operations to push data items onto the top of the stack, and pop items off the top.
Sometimes, stacks and queues are referred to as abstract data types. In this case, abstract means that (just like in pseudocode) we don’t really care about some details of implementation. If a data structure allows items to be pushed and popped, following a “last in, first out” order, we say that it is a stack. If it quacks like a duck, it is a duck!
There are many ways to implement stacks and queues, and good reasons to choose particular implementations, such as the computational costs of push and pop operations, but we will focus on how stacks and queues can be used in algorithms. Fortunately, Python lists can be used as fine stacks, using the built-inappend
(to push data onto the back of the list) and pop
methods.Sometimes, it’s nice to make a class describing a data structure. You should not be able to access the interior elements of a stack (that would probably indicate an error in your code) but if you are using a ‘bare’ Python list, you could do that without Python warning you. Here’s an implementation of a Stack class:
When you get in line (or enqueue) for an amusement park ride, you have your spot in line. Nobody can cut in front of you, and you are guaranteed to get out from the front of the line (dequeue) before anybody behind you. Like stacks, queues allow for the ordering of objects while following a certain set of rules. Whereas stacks are “last in, first out,” queues are the opposite - “first in, first out.”
We will need operations to enqueue data items at the end of the queue, and dequeue items from the front. Here is a visualization of a queue (written by Dartmouth student Daniel Shanker), with enqueue and dequeue operations. Try to mimic the following sequence (each character’s name is denoted by the letter next to them, and you can enqueue them by clicking on them): Dave and Eliza get in line to buy movie tickets, followed shortly by Alice. Dave is served. Before Eliza is served, Carol and Betty get in line. At that point there are only two tickets left, so only Alice and Carol are served.
You might notice that for the stack implementation above, the worst-case run-time of appending an item is Θ(n) (although it’s only Θ(1) amortized over many appends). If you dequeue an item a queue by deleting an item from the front of a list, the run-time is Θ(n) in all cases. When Python deletes an item from the front of a list, the interpreter loops over all remaining items, shifting them one index earlier in the list.
An alternate implementation would use a data structure called a doubly-linked list for either one, allowing Θ(1) operations to add or remove items from the beginning or end of the list. (Doubly-linked lists are described in a later chapter.) Perhaps that is how the Python deque class is implemented. I’d recommend using a deque if you need a queue or a stack.
Objective: write a class that implements a data structure by wrapping a simple data type.
There may be many different ways to implement a data type such as Queue. We define the Queue based on the operations available on the Queue, not based on how it is implemented: a Queue is a data type that supports enqueue
and dequeue
operations. It’s convenient to have a class that handles the details of the implementation.
Write a Queue class that internally uses a list to keep the data in order. Your Queue class should have methods to enqueue
an item, dequeue
and return the item, empty
which returns True
if there are no more items on the Queue, and __str__
which returns a string representation of the queue.
(Hint: none of the bodies of these methods needs to be more than one line long.)
Here is a solution; solve the problem yourself before checking the solution.
A Python list allows each value to be accessed by an integer index. A dictionary allows each value to be accessed by an index, or key that might not be an integer – for example, a string.
We’ll consider an example problem where dictionaries are useful: document similarity using word frequency histograms. Before that, let’s look at how we might use an ordinary Python list to do something similar, using the indexing of a Python list in a clever way.
Objective: Implement an efficient algorithm that counts the frequency of numbers in a list.
You have a lot of digits stored in a list, digits
. Maybe the digits are a secret code! How frequently does each digit appear? Create a new list, frequency
with indices 0 through 9, and implement a linear-time algorithm such that when run, frequency[0]
contains the number of times that 0 appears in digits
, frequency[1]
contains the number of times that 1 appears, and so forth. (Hint. You should need to loop over the digits
list exactly one time.)
Here is a solution; solve the problem yourself before checking the solution.
The frequency-counting trick is pretty nifty, but seriously, how often do you encounter a bunch of numbers and want to count how frequent each is, unless you work for a three-letter government agency?
Here’s similar motivating example problem, though, used by web search engines to find web pages.
The idea is, given a document, count how many times each word appears. Some words appear frequently in all documents, such as a, the, and, of. Let’s ignore those. But if the word blood appears many times, then the document might be about medical issues. It might also be a Halloween story. If the words patient and illness also appear many times as well, we expect it to be even more likely to be a medical document.
That’s why we might wish to count how many times each word appears. We could then presumably compare our frequency histograms for two documents to get a measure of how similar the two documents are.
We might like to use the same trick with list indexing to count word frequencies as we did to count digit frequencies, but unfortunately, we can’t index into a list with words the same way we can index into a list with digits. However, we can index into a Python data structure called a dictionary using a string as the index, or key, to find a corresponding value.
You create a dictionary using braces {
and }
, whereas you create a list using brackets [
and ]
. Let’s create an empty dictionary first:
my_dictionary = {}
We can access items in the dictionary using square bracket notation, just as we do for lists. For example, if we want to set the value of the dictionary item with key "blue"
to be 460, we could do it like this:
my_dictionary["blue"] = 460
Notice that a dictionary never gives a “list index out of range” error. If the item associated with the key "blue"
was not in the dictionary, a new entry in the dictionary is created by the above line of code.
Just like a list, a dictionary can store any type of Python value:
my_dictionary["blue"] = "A color with wavelength of approximately 460 nm."
A formal description of a dictionary is that a dictionary is a data stucture that stores relationship between key-value pairs. The key is the item used to index into the dictionary ("blue"
, in our example), and the value is the item stored in the dictionary (first, 460, and then "A color with wavelength of approximately 460 nm."
).
We might imagine several ways that dictionaries might be implemented. Most of the ways you might imagine first would work, but would have expensive (in terms of time) methods to actually get the value for a particular key. For example, you might imagine a dictionary internally storing two lists, let’s say keys
and values
. To find a value for a particular key, you could loop over keys
to find the index j
of the right one (so that the key you want is in keys[j]
), and then find the value in values[j]
. This strategy is a form of linear search, and so it has a worst-case time cost of Θ(n), where n is the number of items in the dictionary. This is not how dictionaries are implemented in Python. In fact, Python uses a clever implementation that allows indexing into the dictionary to almost always take constant time, though in the worst case, which occurs very rarely, it can take Θ(n) time.
Sometimes, it is useful to be able to loop over dictionary items. If you use the for ... in ...
syntax, the items looped over will be the keys of the dictionary. What if you want the values too? Well, if you have a key, it is easy to get the value. So we can do something like this:
Although here the output appeared in the order in which the items were placed into the dictionary, the order in which keys appear in a for-loop over a dictionary is undefined. All you know is that you’ll get all the keys in some order.
It can be useful to check whjether a particular key is in the dictionary, since an error will be generated if you ask for the value associated with a key that is not in the dictionary. You can use the special in
operator to test; it gives back a boolean value, like this:
(There’s also a not in
operator, which gives the opposite of in
.)
By the way, Python lists have the same operator:
primes = [2, 3, 5, 7, 11]
print 5 in primes # prints True
print 6 in primes # prints False
Objective: Make use of a dictionary to count frequencies.
The following code loads the words of the Constitution of the United States into a list. Some of the words are interesting, in that they contribute to the uniqueness of the document, and others are not. For example, the word “the” appears frequently in the document, but is not interesting. Create a dictionary that contains all words of length greater than four as keys, and the frequency of each of those words as associated values.
Here is a solution; solve the problem yourself before checking the solution.
Here is the basic idea of how dictionaries work. When a dictionary is created, Python allocates space in memory to store the dictionary, and creates what is essentially a Python list in that space. The size of this list is called the capacity of the dictionary. When the dictionary goes to find the value associated with some key, the first thing that happens is that some numeric value is computed for the key; no matter what type the key is—a number, a string, an object reference, you name it—some numeric value is computed based on the key. That numeric value is used to index into the Python list associated with the dictionary. Since both the computation of the value and the Python list indexing take constant time, we can locate the value in constant time. (Usually. We’ll come back to this shortly, in the subsection below on “buckets and linked lists.”)
The first step of indexing into the dictionary by key is converting the key to some numeric value, called the hash code. To convert the key to a hash code, Python uses a hash function that takes the key as input, and somehow generates a number. Let’s look at a hash function that computes a number based on a key that is a string.
This function uses a couple of features of Python that are new to you. First is the assert-statement. Here, we give the Python keyword assert
, followed by a boolean expression. If the expression evaluates to True
, we just go on to the next statement. But if the expression evaluates to False
, then we get an AssertionError. For example, my_hash_function
has an assertion that checks whether the type of its parameter key
is a string. If it’s not, we gat an AssertionError like this:
Traceback (most recent call last):
File "/Users/thc/PycharmProjects/cs1proj/hash_code.py", line 15, in <module>
print my_hash_function(99)
File "/Users/thc/PycharmProjects/cs1proj/hash_code.py", line 2, in my_hash_function
assert type(key) == str # check if the key is a string
AssertionError
You have seen the built-in function type
, which returns the type of its argument, and the built-in global variable str
. But the other new feature of Python here is the built-in function ord
, which takes a string of length 1 and returns an integer value that is the ASCII code for the character in the string. For example, because the ASCII code for a
is 97, ord("a")
is 97. Similarly, ord("b")
is 98. We add up all the ord
values of the individual characters in the string to get some number.
Some hash functions are better than others. The one computed by my_hash_function
is actually pretty bad. There are two properties we would like a hash function to have:
The function we just wrote could work as a hash function, but it is bad from both perspectives. First, does the code distribute keys evenly throughout the table? No. Most strings will have multiple letters, and I would expect small indices in the table to be used infrequently. Most words are limited in length, however, and so very large indices in the table will not be used at all.
Second, does the function minimize collisions? No. Two words that are anagrams (you can rearrange the letters in one to get the other) will have the same hash code. For example, the following anagrams will have the same hash code: “opts”, “post”, “pots”, “spot”, “stop”, “tops”. And there are many other words that we expect to have the same hash value using this function, such as “pat” and “let”.
Python has a better hash function built in. You can call it using hash
. For example:
The number is genearated by hash
is frequently large; before using the value, the dictionary actually discards several bits from the bit representation of the number in memory, to ensure that the index is less than the total capacity of the hash table. We will not go into the details of how the Python hash function works, but you are welcome to look it up on the web if you are curious.
No matter how clever the hash function, sometimes two different keys hash to the same hash code value: a collision. What happens? It turns out that values are not always stored directly at indices of the table found from the hash function. Instead, if there are collisions, the slot in the hash table contains the address of a linked list. The linked list contains nodes. Each node contains an item that should be stored in this position in the table.
We’ll see linked lists soon, but in the worst case, the linked lists mean that the running time for accessing a key-value pair is Θ(n). It could happen that the hash function put many of the items in the same location in the table. Choosing a good hash function and ensuring that the capacity of the table is large enough makes this outcome extremely unlikely. Usually, the linked lists will be very short, leading to a constant time for indexing items.
If you were wondering where this term “hash” came from, it’s because a good hash function will “make hash” of the key, slicing and dicing it into a numerical representation.