, 6 min read
Hashing with Iterator in C
Many scripting languages, like Awk, Perl, Python, Lua, and so on, have hash-tables already built-in. I needed hashing functionality similar to Perl's hash-variables. In particular I needed lookup, insert, and iterating over all entries.
I wrote on hashing in Hashing Just Random Numbers and Hash functions: An empirical comparison — article by Peter Kankowski. See also Revisiting hash table performance, or TommyDS: a C library of hashtables and tries.
1. Structure. Each hash element contains a key and a value. Additionally there is an indicator whether storage allocation using malloc()
for key and value are done during hash insert, or are done elsewhere, e.g., constant strings. All hash elements having the same hash value, see below, are inserted into a singly linked list. Additionally all elements are part of singly linked list.
struct hashElem {
int dup; // memory ownership: 0, 0x01, 0x02, 0x03
char *key, *val;
struct hashElem *next; // next hash element with same key or NULL
struct hashElem *nextAll; // for iterator over all hash entries
};
There is a hash header which contains the start of the list, and the actual table.
typedef struct {
int size, collisions, load;
const char *name; // only for diagnostic and debugging
struct hashElem **tab; // hash table with size times (hashElem*)
struct hashElem *first; // first element in hash
} hash_t;
Variable `load` is the number of entries in the hash table. So
if (zzr01Hash->load > 0) ...
is equivalent to the following Perl code
if (keys %zzr01Hash > 0) ...
The C program uses 7 hash tables. Each of these is initialized by
// Initialize hash with given size, return NULL on error
hash_t *hashInit (int size, const char *name) {
hash_t *h = (hash_t*)malloc(sizeof(hash_t));
if (h == NULL) return NULL;
// allocate size x pointer to hashElem, calloc clears memory
h->tab = (struct hashElem**)calloc(sizeof(struct hashElem*),size);
if (h->tab == NULL) return NULL;
h->size = size;
h->collisions = 0;
h->load = 0;
h->name = name;
h->first = NULL;
return h;
}
So the C code
pgmHash = hashInit(ZZ_HASHSIZE,"pgmHash");
greenHash = hashInit(GREEN_HASHSIZE,"greenHash");
zzHash = hashInit(ZZ_HASHSIZE,"zzHash");
decommHash = hashInit(CORIA_HASHSIZE,"decommHash");
dynHash = hashInit(CALL_HASHSIZE,"dynHash");
callHash = hashInit(CALL_HASHSIZE,"callHash");
zzr01Hash = hashInit(CALL_HASHSIZE,"zzr01Hash");
is equivalent to the following Perl code
my (%pgmHash, %greenHash, %zzHash, %decommHash, %dynHash, %callHash, %zzr01Hash);
Clearly, the Perl version is much more terse and easier to write.
2. Lookup. For lookup I used a slight variation of Kernighan & Ritchie's hash function given in their seminal work The C Programming Language. Instead of adding up the characters I used exclusive-or instead. Initial value is 5381 from Dan Bernstein, instead of zero.
// Hashing function according Kernighan & Ritchie,
// The C Programming Language, chapter 6.6
unsigned int hashKR (hash_t *h, const char *s) {
unsigned int hv = 5381;
for (; *s; ++s) {
hv = 31 * hv ^ *s;
}
return hv % h->size;
}
Lookup is now
// Find key in hash
const char *hashLookup (hash_t *h, const char *key) {
unsigned int hv = hashKR(h,key);
struct hashElem *e = h->tab[hv];
while (e) {
if (strcmp(e->key,key) == 0) return e->val;
e = e->next;
}
return NULL;
}
So the C code
if (hashLookup(pgmHash,token)) ...
is equivalent to the following Perl code
if (defined($pgmHash{$token}) ...
3. Insertion. Inserting new key/value pair is
// Insert key into hash h
// Use strdup() for key/val if dup nonzero
// possible values for dup: 0x00 no strdup(), 0x01: key ony, 0x02: val only, 0x03: key+val
int hashInsert (hash_t *h, const char *key, const char *val, int dup) {
unsigned int hv = hashKR(h,key);
struct hashElem *e = h->tab[hv];
while (e) {
assert(e->key);
if (strcmp(e->key,key) == 0) {
e->dup = dup; // overwrite dup
// check if exact same key/val pair already present
if (strcmp(e->val,val) == 0) return 0;
if (dup & 0x02) free(e->val); // overwrite val
return (e->val = dup & 0x02 ? strdup(val) : (char*)val) == NULL;
}
e = e->next;
h->collisions += 1;
}
// key not found, i.e., e==NULL || collisions > 0
h->load += 1; // new element
e = (struct hashElem*)malloc(sizeof(struct hashElem));
if (e == NULL) return 2;
e->dup = dup;
e->key = dup & 0x01 ? strdup(key) : (char*)key;
e->val = dup & 0x02 ? strdup(val) : (char*)val;
if (e->key == NULL || e->val == NULL) return 3;
// elements with same keys are all members of a singly linked list
e->next = h->tab[hv]; // prepend one-self at head of list
h->tab[hv] = e;
// maintain 2nd singly linked list for iterator over all hash elements
e->nextAll = h->first;
h->first = e;
return 0; // all is well
}
Depending on the dup
licate variable it either calls strdup()
oder just copies the pointer. Function strdup()
calls malloc()
.
The C code
hashInsert(dynHash,token[1],token[j],3);
is equivalent to the Perl code
$dynHash{$token1} = $tokenJ;
Three hash tables are re-used over and over again. So there must be a function to clear the hash table.
// Free all key/value pairs in hash, but hash-table memory h->tab[] is not freed.
// Hash table can then be reused for new key/value pairs.
void hashClear (hash_t *h) {
int i;
struct hashElem *e, *sv;
for (i=0; i < h->size; ++i) {
for (e=h->tab[i]; e; ) {
if (e->dup & 0x01) free(e->key);
if (e->dup & 0x02) free(e->val);
e->nextAll = NULL;
sv = e; // save this e
e = e->next;
free(sv);
}
h->tab[i] = NULL; // clear slot in tab[]
}
h->collisions = 0;
h->load = 0;
h->first = NULL;
}
So the C code
hashClear(dynHash);
hashClear(callHash);
hashClear(zzr01Hash);
is equivalent to the following Perl code
(%dynHash, %callHash, %zzr01Hash) = ((), (), ());
4. Statistics. For debugging and informational purposes it is advantageous to have information on the distribution of key/value pairs in our hash-tables.
Function hashStat()
populates an array with the distribution of key/value pairs. Above a certain threshold, given in distrMax
, all collisions are grouped collectively.
// Statistics on hash table
// distr[0] = number of empty slots
// distr[1] = number of entries with exactly one element
// distr[2] = number of entries with exactly two elements in linked list
// and so on
// last distr[distrMax] lumps all the rest
void hashStat (hash_t *h, int distr[], int distrMax) {
int i, cnt;
struct hashElem *e;
--distrMax;
if (distrMax < 0) return;
for (i=0; i<=distrMax; ++i) distr[i] = 0; // clear statistics
for (i=0; i < h->size; ++i) {
cnt = 0;
for (e=h->tab[i]; e; e=e->next)
++cnt;
distr[cnt<=distrMax?cnt:distrMax] += 1;
}
}
Printing this distribution is now:
void hashStatPrt (hash_t *h) {
enum { DISTRMAX = 10 };
int i, distr[DISTRMAX];
hashStat(h,distr,DISTRMAX);
printf("\t%s: size=%d, load=%d, load/size=%7.4f, collisions=%d\n",
h->name, h->size, h->load, h->load / (double)(h->size), h->collisions);
for (i=0; i<DISTRMAX; ++i)
printf("\t\tdistr[%2d]=%d\n",i,distr[i]);
}
Example output for distribution using
hashStatPrt(pgmHash);
hashStatPrt(greenHash);
hashStatPrt(zzHash);
is
pgmHash: size=97007, load=5393, load/size= 0.0556, collisions=145
distr[ 0]=91758
distr[ 1]=5106
distr[ 2]=142
distr[ 3]=1
distr[ 4]=0
distr[ 5]=0
distr[ 6]=0
distr[ 7]=0
distr[ 8]=0
distr[ 9]=0
greenHash: size=73, load=13, load/size= 0.1781, collisions=0
distr[ 0]=60
distr[ 1]=13
distr[ 2]=0
distr[ 3]=0
distr[ 4]=0
distr[ 5]=0
distr[ 6]=0
distr[ 7]=0
distr[ 8]=0
distr[ 9]=0
zzHash: size=97007, load=43767, load/size= 0.4512, collisions=10211
distr[ 0]=62029
distr[ 1]=27466
distr[ 2]=6367
distr[ 3]=1026
distr[ 4]=106
distr[ 5]=13
distr[ 6]=0
distr[ 7]=0
distr[ 8]=0
distr[ 9]=0
Just for debugging, printing all entries in the hash-table is simple:
void hashPrt (hash_t *h) {
int i=0;
struct hashElem *e;
printf("\t%s\n",h->name);
for (e=h->first; e; e=e->nextAll)
printf("\t\t%d key=|%s|, val=|%s|\n",++i,e->key,e->val);
}
Added 17-Mar-2018: Also see Iterating over hash sets quickly in Java.