updated: 2022-05-05_11:08:04-04:00
Hashing Implementations
- Balanced binary search tree
- effective but not instantaneous
- Hashing gives us direct access O(c) by key value
- h(K) = i in the array
- h is a hash function
- array holds values is called Hash Table HT
- We want a big table, small amount of keys
Mid Square Method (square number and take middle r bits)
- We are trying to minimize the number of collisions
Folding: equal weight to all characters, folding all chars together
- Open Hashing
- store collisions outside the HT
- Each slot is head of a linked list
- Closed Hashing
- store collisions at another slot in the HT
- All records are in the HT
- $h(k_{R})$: home position for record R with key $k_{R}$
- Collision Resolution Policy - for deciding where to store R (should the slot be occupied)
- Same policy must be followed to retrieve record R
Bucket Hashing
- closed
- divided into M/B
- (Prof. did not like bucket hashing)
Probing
-
closed
-
Collision Resolution Strategy must provide a probe sequence
-
p(K,i) - the probe function that provides this sequence
- K: key
- i: probe i
Linear Probing
p(K,i) = i
-
good: all slots are candidates
-
bad: not ever slot is equally likely to be chosen
-
if slot is already occupied, probability is 0 for next filled
-
Probability of slot 6 is 1/|HT| + each contiguous filled slot
Problem: Primary Clustering
- This is caused by linear probing causing the records to clump
- Possible fix: cycle through all slots before returning to home
- we multiply the value by a factor called c
- p(k,i)=ci
- for this to work, c must be relatively prime to M (Table size)
Pseudo-random probing
-
p(k,i) = Perm[i-1]
-
EG: M = 101, Perm[1] = 5, Perm[2] = 2, Perm[3] = 32
- p(k$_{1}$)=30,35,32,62
- p(k$_{2}$)=35,40,37,67
- This didn't make sense
Better Pseudorandom Example
| Table | = 10
Perm = [2,6,5,1,7,9,3,8,4]
We use permutation for the offset
Slot 0 is the home slot
keys = 2, 22,42,32,12
p(k,i) = Perm[i-1]
first is 2, 2 mod 10 is 2
22? (home is 2)
42 mod 10 = 2.
2 is occupied
Collision resolution mode
home is 2, i = 1, first one in sequence
P(k,i) =
32 mod 10 = 2
start process over, i = 1
p[0]=2
2 + home = 4, 4 mod 10 is 4, do it again
i = 2
home is still 2
p(k,i)=p[1]=6
6+2 = 8, 8 is full,
i = 3
home is still 2
p[2] = 5
2+5 = 7
7 mod 10 is 7, 7 is open
Slot | Value Stored | Prob. Next | Probe Seg. |
---|---|---|---|
0 | 1/10 + 1/10 | ||
1 | 1/10 | ||
2 | 2 | 0 so sending to 9 | 4,8,7,3,9 |
3 | 12 | 0 so sending to 5 | 5,9,8,4,0 |
4 | 22 | 0 so sending to 6 | 6,0,9,5,1 |
5 | 1/10 + 1/10 | ||
6 | 1/10 + 1/10 | ||
7 | 32 | 0 so sending to 9 | 9,3,2,4,6 |
8 | 42 | 0 so sending to 0 | 0,4,3,9,5 |
9 | 1/10 + 1/10 + 1/10 |
Quadratic Hashing
Basically, add i^2 to the home slot
Collision Resolution
- Primary clustering is a problem
- pseudo/quadradic help, but clustering still occurs
- Secondary Clustering
- when two keys are assigned to the same home slot
Double Hashing
-
Helps solve secondary clustering
-
A return to linear probing where the constant is determined using the key K
-
p(K,i) = i * h_2 (K)
Example:
h(K) = K % M
= K % 7
p(K,i) = i * h${2}$(K)
h${2}$(K) = REV (K) % 5
For this second hash function we chose reverse mod 5. We have to use the key somehow for double hashing
Keys: 2, 15, 14, 23, 81
2 - 2 mod 7 is 2
First collsion is 23:
- H(K) = 23%7 = 2; collision
$H_{2}$(K) = REV(23)%5 = 32%5 = 2
so, p(K,i) = i * 2 = {2,4,6, ... } - H(K) = 81%7 = 4; collision
$H_{2}$(K) = REV(81)%5 = 18%5 = 3
P(K,i) = i * $H_{2}$(K) = {3,6,9, ... }
Base slot = 4, we add the P(K,i) to it
4+3%7=0, collision
4+6%7=3, N.C.
Slot | Value Stored | Explanation |
---|---|---|
0 | 14 | h(K) = 14%7 = 0; N.C. |
1 | 15 | h(K) = 15%7 = 1; N.C. |
2 | 2 | h(K) = 2%7 = 2; no collision |
3 | 81 | See number 2 |
4 | 23 | See number 1 |
| 5 | | |
| 6 | | |
Performance Analysis
- a fast method of storing and retrieving keys
- All searches require constant time
- How full the hash table is is the best metric of performance
Size considerations
Load Factor: n/k where:
n = number of entries
k = number of buckets
As load factor increases, hash table performance slows
- In general, 70 - 75 % fullness is best
Deletion
- No duplicate records stored in the table
- deletion cannot undermine later searches
- solution is tombstone
- Hash table will fill up the more tombstones we have