Data structures

Table of Contents

This topic should be studied together with the standard algorithms used with each data structure, and their use in writing programs in actual programming languages.

Need for data structures

Usually a program needs to deal with not just one data value, but a whole set of them. Examples are

Each of these is a collection of data. We need a way to arrange all the data items, together somehow, in some structure, that we can handle in software. Different standard structures are available, suitable for different use cases. The pieces of data in the structure are called elements or nodes.

A static data structure is fixed in size. It cannot grow or shrink during execution. For example in a card game we might want a data structure for the card deck. There are always 52 cards in the deck, so a static data structure would be good.

A dynamic data structure can change in size during execution. In a card game the number of cards in a hand might change as cards are dealt and played. So a dynamic data structure would be needed here.

We will cover the following structures: arrays, files, queue, stack, graph, tree, hash table, dictionary and vector.

Key-value pairs

Usually each data item ‘thing’  in a data structure is a record. The record has a key field, and other data fields, referred to as values. The key field needs to be unique, different for every item.

For example, a data structure might contain data about students. A sample record might be:

101257

Roxanna

Begum

07748 861 001

15.1.2001

..

Student number

First name

Last name

Phone

Date of birth

More fields..

The key field is in yellow, and value fields in blue. Note we cannot use personal names as key fields, because we can have more than one student with the same name.

We usually use the data structure to find the value fields of a record with some key. For exampe we might need to know the phone number of student with ID 100753.

We will often just use a single data value – a number or a string – in our data structure examples. In the real things, these will be records.

Memory and file storage

We can hold data structures in memory or in files on a non-volatile medium such as disc. Memory is much faster, but file storage can be much larger.

Software cannot access file data directly. It must be read into memory, and it can be processed there by program instructions, and if needed, written back out to the file. A memory area used to hold data for this is usually called a buffer. We usually do not need to read an entire file into memory at the same time - just a part.  

Most of the data structures discussed here, like stacks and queues and trees, are in memory.

Fields, records and files

A file is a collection of data usually held on a non-volatile medium – one that keeps data when power is switched off.

Files are either ‘binary’ or ‘text’. In a text file, data is represented as characters. In a binary file, data is held some standard format. Examples of binary files are image files (like a jpg or png) or  a sound file like an mp3, or a video like an avi.

Both binary and text files are digital, so both are streams of bits.

Text files often consist of a sequence of records, each record having a set of fields. For example a file might contain data on customers. A customer record might have fields for customer ID, name, address, phone number, date of birth and email address.  The customer ID would be unique (each customer has a different ID), and would be the key field. The file might have thousands of records.

To use a file in code, we have to

open the file

read or write to it

close it

File access is actually done by the underlying operating system. This keeps track of different files by having a file handle or file channel for each open file. Programs in a high level language access files through the operating system facilities.

Each  file handle uses memory. If we open files but do not close them, we will eventually run out of available file handles, which are always limited.

Abstract data types

Image9

An abstract data type ( ADT ) is defined by what it can do. For example a stack is like a pile. You can put things on top of the pile (push a value) and take a value off the top (pop). Push and pop are the only two things a stack can do. This is what a stack is.

For example, if we push 5 values onto a stack, 9, 8, 4, 12 and 7, with 9 first and 7 last, we get the stack as shown. Then if we pop a value off the stack, we get 7 - the one most recently added.

That example has integers as the data type. Any data type could be used.

In actual programming, we need to implement an ADT – actually code it. For example we implement a stack ADT by writing code to carry out push and pop operations. We can do this in different ways, such as using by an array, or a linked list. These are different implementations of the stack ADT.  

In practice a language library often has implementations of standard ADTs built-in, and these are normally used in real code. But it is a useful learning task to write our own versions to see how they can work..

Arrays

An array is a basic type of data structure. An array is a set of storage locations or boxes, with each box having a number, called the index. We can read or write a box using its index.

In many languages indexes are enclosed in [square brackets], and the numbering starts at 0 not 1.  Arrays are often fixed in length and cannot grow or shrink during execution.

Usually all the elements of an array have the same type.

For example

int myData[100]; // declare myData to be an array of 100 // ints

myData[0] = 23; // store 23 in the first location

myData[9] = 17; // store 17 in the 10th location

y = myData[7]; // read contents of index 7 and store in y

 

An array is an ADT - accessed through an index.

But in many laguages an array is a built-in data structure, implemented as a single block of memory, with each element next to each other. This makes it simple for the runtime to know where to find where an element is. For example, suppose an array starts at address 5000, and has 100 ints, each 4 bytes long. The array will occupy 400 bytes, from 5000 to 5400. Index 0 is at 5000. Index 10 is at address 5000 + 10 X 4 = 5040

If the elements in an array are themselves arrays, we have two dimensional arrays.

A two dimensional array might be used to hold a matrix.

Beyond two, we can have multi-dimensional arrays.

Queues

A queue is like cars queueing at traffic lights. The first one to leave will be at the head of the queue - the first one to join the queue.

A queue is a FIFO list – first in, first out.

If we add 4,6,2 and 3 to an empty queue, in sequence, then remove items, we get 4,6,2 and 3 out. First in, first out.

A queue might be used for a keyboard input buffer. The user wants keystrokes to be processed in the order that they are typed.  

A queue is an ADT. It might be implemented by using an array or a linked list.

In a priority queue, elements have a data value, and also a priority. Items with a higher priority come out of the queue first.

A circular queue is a type of queue implementation. A block of memory is used to store the queue. This might be an array. There are two pointers into the array, head and tail. Data added to the queue goes in at the head, which is then incremented. Data is removed from the tail, which is also incremented.  So in use, the queue moves forward through the array. Eventually the head will hit the array end – when it wraps around back to the array start. Similary for the tail. It might be that the head catches up with the tail, in which case the queue is full.

Circular queues are also called ring buffers. A keyboard input buffer might be a circular queue. Keystrokes go into the queue as keys are hit, and removed as the application ‘uses’ them.   

Stacks

A stack is a LIFO structure – last in first out.

A stack is an ADT which has two operations - push a new item onto a stack, and pop a value off.

So if we push 1,2,3,4 onto a stack, then pop values of, we get 4,3,2,1.

Stacks are used in connection with subroutine calls, forpassing parameters, return values and addresses.

The ADT could be implemented in an array or a linked list.

Graphs

This is not the same as an x-y graph in mathematics. That is a Cartesian graph usually showing some function of x sideways and y upwards. These graphs are different.

A graph might be used to represent roads connecting towns on Google maps. Or computers and printers on a network. Or underground stations on the Paris Metro. Or shipping routes between ports.

For example:

Image1

Graphs are not images. The diagram is a diagrammatic representation of a graph (like a Tube map is not a picture of the Tube network).

The graph has some black circles linked by lines. The circles, labelled A to H, are called nodes or vertexes (singular : vertex ). The lines are called edges or arcs.  These edges have arrows – they are one way links. These are called directed edges, and we have a directed graph. In some graphs the edges do not have arrows, and the graph is undirected.

The lines have numbers. These might be distance in miles, or travel times in minutes, or cost of a plane ticket, or the bandwidth on that part of a network. Whatever the number means, it is the edge weight. This is a weighted graph. Some graphs are not weighted.

Nodes D E and H form a loop – called a cycle. A graph with no cycles is called an acyclic graph.

A B and C are linked, but nothing links these with D to H. This graph is not connected  like road maps on two separate islands. In a connected graph there is a link through intermediate nodes between any two nodes.

Image2

How can we represent a graph in a computer? We want to be able answer questions like ‘how far from x to y’ or ‘what is the fastest route from a to d going through p’? An actual image would not be  much help.

One technique is to use an adjacency matrix. This is a square table, with a row and column for each edge, and the cell showing if the two edges are linked (and maybe the weighting if they are).

For example, in this graph, the adjacency matrix would be

 

A

B

C

D

A

 

1

1

 

B

 

 

1

 

C

 

 

 

1

D

 

 

 

 

The empty cells actually contain 0. The cell in row B column C is 1 because there is an edge B to C. If the edges are weighted, the cell can contain the weight. As a matrix, this might be held in a 2D array in memory.

An alternative is an adjacency list. This is a list of nodes, and for each, a list of nodes linked to it. So for the same graph:

A

B, C

B

C

C

D

D

 

 

With the matrix we can quickly find out if any two nodes have an edge between. But the list uses less memory, especially for a graph with a small number of edges.

Trees

A tree is a type of graph – a connected undirected acyclic graph. Here is a tree:

Image3

Image4

That is not very typical. This is more usual:

This is a rooted tree. There are parent-child relationship between nodes – for example C is the parent of nodes F and G. One node (and only one) has no parent (in this example, node A). This is the root node of the tree – where it ‘starts’.

This is a binary tree, because each node has at most two child nodes (A has 2, F has 1, K has 0). Trees do not have to be binary.

Uses for rooted trees

Image5

One use is the DOM  the document object model for representing the structure of a web page. A simple web page might have two parts – a head and a body. The head might contain the title.Inside the body we might have two divs. The first div might contain two paragraphs, and the second div one paragraph. Here is the tree:

 

Image6

In Javascript code we can search this tree, add and change and remove nodes, and so alter the web page dynamically

 

Another use is an ‘abstract syntax tree’ to represent some program code. For example suppose some program contains the expression

(x+3)*(4*y-7). As a tree this is as shown. An interpreter or compiler will process source code and generate trees like this to represent valid expressions.

 

 

Image7

Another use is a binary search tree (BST). This is a binary tree (maximum 2 child nodes) which is ordered. That is, smaller data goes left and larger goes right.  Here is a BST for some names, ordered alphabetically.

In a real application this might be a store of personnel data. Each node would hold a record on an employee, with fields like name, address, department, pay grade and so on. Here we just show first names.

Suppose we search this for Joe. We start at the root. Joe comes after George, so we go right.

Mike comes after Joe, so we go left.

This node matches Joe and he is found. We can read his paygrade or whatever we need.

Now search for Barry. Start at the root. Barry comes before George, so go left. Barry is before Dave, so go left. Barry is before Brian.

But there is no node to the left of Brian – the left pointer from Brian is null. So, Barry is not in the tree.

Why use a BST? Because it is fast to search. Each comparison (an ‘if’ ) takes us down one level. A two level tree has up to 2 nodes in the lowest level. A three level tree has 2 X 2 = 4. A  four level tree has up to 8. A ten level tree has 512, and we can search it with just 9 comparisons. This is much faster than a list in an array, for example. On average we would find it half way through, after 256 ifs.

Hash tables

A hash table is a type of data structure, containing key-value pairs. It is one type of map, mapping keys to values.

We use a hash table if we want very fast retrieval. That is to find and fetch keys very quickly.

This works by calculating where they are. We do not search for keys. We calculate where they are.

The calculation is called the hash function.

Data is held in a set of storage locations, called buckets. A bucket can hold one record.

A very small simplified example is shown. We have ten buckets. The data has a key field which is a 3 digit integer, from 000 to 999. Th ehashing function is simply to take the first digit of the key field. So key 197 hashes to bucket 1. Key 865 hashes to bucket 8:

Image8

The algorithm to obtain a record, given a key (say 523) is very simple.

  1. Apply the hashing function to get the bucket (5)

  2. Look in that bucket. If its empty, the data is not there. If its not empty – the record has been found

But – suppose we then insert a record with key 511. This also hashes to bucket 5. This is a collision – two key fields hash to the same bucket.

It is usually impossible to find a hashing function which will avoid collisions. They must be handled somehow.

One solution is to place overflow records into linked lists starting at the bucket hashed to, like this:

Image10

Then to find key 272, for example, we

hash it – get 2

look in bucket 2

loop : if it matches, its found

else move to next node in the list

until end of list – not present

This works. But searching a linked list is slow. On average we find it half way along, and this might be a long list.

Another technique is to store it in the next available bucket. So to store 291, if bucket 2 is occupied, store it in bucket 3.  Again retrieval involves a slow linear search.

The fraction of buckets occupied is called the load factor. So if half the buckets re occupied, the load factor is 0.5.

The higher the load factor is, the more likely a collision is, and the hash table will become slow.

We can fix this by re-hashing. This means:

Creating a new, larger number, of empty buckets.

Getting a new hashing function. This is ften done using modulus (remainder) to get the hash into the correct rangle. For example if we have 1000 buckets, we have a hash mod 1000. This will be from 0 to 999 and will fit.

Move existing data into the new buckets.

Return existing buckets to free memory (garbage collect)

Since we now have more buckets, the load factor is lower, and the table faster. But doing the re-hashing takes time and memory.      

Dictionaries

A dictionary is another type of map implementation – a set of key-value pairs. It is sometimes called an associative array. In an ordinary array, we access elements through their index. In a dictionary, we get a value by using its key. Dictionaries usually have just two operations – to put a key-value pair into the dictionary, an dget a vlaue for a given key.

For example, this is a dictionary where the key is a character and the value is an integer. Key s gets value 3. Key h gets value 1.

key

value

t

3

h

1

i

1

s

3

a

1

e

1

 

Suppose, for example, we need to count how many occurences of each letter there is in a text – which we might do, for example, in cryptography.

The algorithm would be:

for each character in the text..

if it not in the dictionary, add it, with value 1

else (it is in) add 1 to its value

If we do this on

this is a test

we get the dictionary shown above.

Vectors

4.2.8.1 Vectors

 

Be familiar with the concept of a vector and the

following notations for specifying a vector: A vector can be represented as a list of

numbers, as a function and as a way of

representing a geometric point in space.

• [2.0, 3.14159, -1.0, 2.718281828]

• 4-vector over R written as R 4

• function interpretation

• 0 ↦ 2.0

• 1 ↦ 3.14159

• 2 ↦ -1.0

• 3 ↦ 2.718281828

• ↦ means maps to

A dictionary is a useful way of representing a

vector if a vector is viewed as a function.

f : S → R

the set S = {0,1,2,3} and the co-domain, R, the

set of Reals

For example, in Python the 4-vector example

could be represented as a dictionary as follows:

That all the entries must be drawn from the

same field, eg R. {0:2.0, 1:3.14159, 2:-1.0, 3:2.718281828}

Dictionary representation of a vector. See above.

List representation of a vector. For example, in Python, a 2-vector over R

would be written as [2.0,3.0].

1-D array representation of a vector. For example in VB.Net, a 4-vector over R would

be written as Dim example(3) As Single.

Visualising a vector as an arrow. For example a 2-vector [2.0, 3.0] over R can be

represented by an arrow with its tail at the origin

and its head at (2.0, 3.0).

Vector addition and scalar-vector multiplication. Know that vector addition achieves translation

and scalar-vector multiplication achieves

scaling.

Convex combination of two vectors, u and v. Is an expression of the form αu + βv where α, β

≥ 0 and α + β = 1

Dot or scalar product of two vectors. The dot product of two vectors, u and v,

u = [u 1 , ...., u n ] and v = [v 1 , ....., v n ] is

u ∙ v = u 1 v 1 + u 2 v 2 + ...... + u n v n

Applications of dot product.

Finding the angle between two vectors.