|
 |
|
 |
10-11-2006, 06:31 AM
|
#1 (permalink)
|
|
Recruit
Join Date: Sep 2006
Posts: 10
|
Help with a Hashing program
Hello,
I'm writing a database program that utilizes hashing for a Data Structures class. I have it working mostly, but I'm having a problem with a stack overflow error if I try and use a hash size of more than 11,000 or so. I need to be able to use about 60,000 for the size in order to be able tor ead the huge file he gave us of about 20,000 lines.
Also I'm having a problem with command line arguments. It's supposed to read in the hash size as the first argument and the file name to open as the second. I have some code in there my teacher gave me, but I'm confused as to how to actually use it. Can someone explain this to me? My code is below:
dbase.cpp file:
Code:
#include "dbase.h"
#include <vector>
#include <fstream>
#include <string>
#include <iostream>
int collisions;
/* Constructor, initializes class variables and sets all slots to start as slotFree */
Database::Database()
{
nItems = changed = nProbes = s = 0;
strcpy(lastFile, "");
for(int i=0; i<HASHSIZE; i++)
status[i] = slotFree;
}
// Destructor
Database::~Database()
{}
// Function to add entries to the database
bool Database::add(char *sLast, char *sFirst, char *sPhone, char *sId, char* sGender)
{
int slot;
s = 0;
if (nItems<MAXENTRIES) // Make sure database isn't full
{
slot = Hash(sLast); // Hash the key to get slot #
while(status[slot] == slotUsed) // If collision occurs, find open slot
// using quadratic probe
{
collisions++;
s++;
slot = (slot + s*s) % HASHSIZE; // <-- Quadratic probe
}
/* Empty slot was found, so add the entry there */
strcpy(last[slot], sLast);
strcpy(first[slot], sFirst);
strcpy(phone[slot], sPhone);
strcpy(id[slot], sId);
strcpy(gender[slot], sGender);
status[slot] = slotUsed;
nItems++;
changed++; // Show that file was changed.
return TRUE;
} else
return FALSE;
}
/* getFirst Function to find the first matching entry */
bool Database::getFirst(char *sLast, char *sFirst, char *sPhone, char *sId, char *sGender)
{
int slot;
s = 0; /* s is local in this function because this function doesn't
exit until it finds a free slot, and s doesnt need to be
remembered */
slot = Hash(sLast);
while(status[slot] == slotUsed && (strcmp(sLast, last[slot]) != 0))
{
s++;
nProbes++;
slot = (slot + s*s) % HASHSIZE;
}
nProbes++;
if (status[slot] == slotUsed && (strcmp(sLast, last[slot]) == 0))
{ /* fill in the remaining return values since key was found */
strcpy(sFirst, first[slot]);
strcpy(sPhone, phone[slot]);
strcpy(sId, id[slot]);
strcpy(sGender, gender[slot]);
current = slot; //set current slot so getNext will know where to start
s++;
return TRUE;
} else
return FALSE;
}
/* getNext Function to find additional matches */
bool Database::getNext(char *sLast, char *sFirst, char *sPhone, char *sId, char *sGender)
{
int slot;
slot = (current + s*s) % HASHSIZE; /* slot and s are class variables */
while(status[slot] != slotFree && (strcmp(sLast, last[slot]) != 0))
{
s++;
nProbes++;
slot = (slot + s*s) % HASHSIZE;
}
nProbes++;
if (status[slot] == slotUsed && (strcmp(sLast, last[slot]) == 0))
{
strcpy(sFirst, first[slot]);
strcpy(sPhone, phone[slot]);
strcpy(sId, id[slot]);
strcpy(sGender, gender[slot]);
current = slot; // remember these two so getNext can resume where
s++; // it left of on the next consecutive search
return TRUE;
} else
return FALSE;
}
/* load Function to read file entries and call add function to put them in database */
bool Database::load(char *fname)
{
ifstream infile; // Create input stream
char tmpFirst[16] , // Create temp variables to store current record
tmpLast[16] ,
tmpPhone[11],
tmpId[11],
tmpGender[2];
strcpy(lastFile, fname);
infile.open(fname);
if (infile.fail())
return FALSE; // Couldn't open file
for (int i=0; i < HASHSIZE; i++) // Reinitialize slots to be Free
status[i] = slotFree;
while(!infile.eof()) // While it is not the end of the file
{
infile>>tmpLast; // Get each part of entry from the line
infile>>tmpFirst;
infile>>tmpPhone;
infile>>tmpId;
infile>>tmpGender;
if (!add(tmpLast, tmpFirst, tmpPhone, tmpId, tmpGender))
return FALSE;
}
cout<<"# of Collisions: " << collisions<<endl;
infile.close();
changed = 0;
return TRUE;
}
/* Hash Function, requires Last name to hash */
int Database::Hash(char *sLast)
{
int sum = 0, i;
/* hashes key using the ASCII values of each letter in the name.
to create a good hash, each subsequent letter is weighted heavier
and heavier than the prior ones */
for(i = 0; sLast[i]; i++)
sum = sum * i + sLast[i];
return abs(sum) % HASHSIZE; /* used abs() in case of sum getting too large
and wrapping around as a negative */
}
dbase.h file:
Code:
#include <fstream.h>
#include <iostream.h>
#include <string.h>
#include <math.h>
const int HASHSIZE = 11000; // hash table must be larger than MAXENTRIES
// to reduce the # of collisions that occur
const int MAXENTRIES = 3000; // # of entries database can hold
const int TRUE = 1;
const int FALSE = 0;
enum States { slotFree, slotUsed, slotDeleted};
/* Creates the class for the database */
class Database {
private:
char first [HASHSIZE][32],
last [HASHSIZE][32],
phone [HASHSIZE][16],
id [HASHSIZE][6],
gender [HASHSIZE][2],
lastFile[80]; // stores filename in Load
int nItems,
s, // used in quadratic probe(collisions)
nProbes, // tracks # of times we look at the table
// ( to demonstrate efficiency )
changed, // tracks if array has been modified
current;
States status[HASHSIZE]; // runs parallel to hash table and does bookkeeping
// (keeps track of slotFree, slotUsed, etc.)
int Hash(char *);
public:
/* Function prototypes */
Database();
~Database();
int getProbes(){ return nProbes; }
bool add(char *, char *, char *, char *, char *);
bool getFirst(char *, char *, char *, char *, char *);
bool getNext (char *, char *, char *, char *, char *);
bool load(char *);
};
main.cpp file:
Code:
#include "dbase.h"
char menu(void);
void main(int argc, char *argv[]){
bool success;
int found;
Database DataBase;
char first[64], last[32], phone[16], id[32], gender[1], choice, file[80];
/* Attempts to load file if specified on command line */
if (argc==2)
if (success = DataBase.load(argv[1]))
cout << argv[1] << " successfully loaded";
else
cout << argv[1] << " was not loaded";
/* Temporary Menu Function made to test program function */
choice = menu();
while(choice!='q' && choice!='Q')
{
switch(choice)
{
/* CASE: LOAD */
case 'L':
case 'l':
cout << "Enter the filename: " << flush;
cin >> file;
if (success = DataBase.load(file))
cout << file << " successfully loaded\n\n";
else
cout << file << " was not loaded\n\n";
break;
/* CASE: ADD */
case 'A':
case 'a':
cout << "First Name: ";
cin >> first;
cout << "Last Name: ";
cin >> last;
cout << "Phone Number: ";
cin >> phone;
cout << "ID Number: ";
cin >> id;
cout << "Gender: ";
cin >> gender;
if (success = DataBase.add(last, first, phone, id, gender))
cout << "Entry Added\n\n";
else
cout << "Unable to add entry\n\n";
break;
/* CASE: SEARCH */
case 'S':
case 's':
cout << "Enter key to search for: ";
cin >> last;
if(found = DataBase.getFirst(last, first, phone, id, gender))
{
cout << "\n\n" << first << "\t" << last << "\t" << phone << "\t" << id << "\t" << gender << endl;
while(found)
if (found = DataBase.getNext(last, first, phone, id, gender))
cout << first << "\t" << last << "\t" << phone << "\t" << id << "\t" << gender << endl;
}
else
cout << "\nKey not found!\n";
break;
/* CASE: SELECTION DOES NOT EXIST */
default:
cout << "That is not a valid option\nPlease choose again";
}
choice = menu();
}
cout << DataBase.getProbes() << endl;
}
/* Temporary Menu for testing purposes */
char menu(void){
char choice;
cout << "\n[L]oad file\n"
<< "[A]dd Entry\n" << "[S]earch\n\n"
<< "[Q]uit\n\n";
cout << "Enter your choice: " << flush;
cin >> choice;
return choice;
}
Last edited by redhead; 10-11-2006 at 10:15 AM.
|
|
|
10-11-2006, 02:05 PM
|
#2 (permalink)
|
|
Code Monkey
Join Date: Mar 2005
Posts: 55
|
Variables created on the stack take up space on the stack, so when you need large arrays use malloc or new.
|
|
|
10-11-2006, 02:07 PM
|
#3 (permalink)
|
|
Code Monkey
Join Date: Mar 2005
Posts: 55
|
argv[0] always gives you the name of the executed program, and your command line arguments start at argv[1].
|
|
|
10-11-2006, 02:14 PM
|
#4 (permalink)
|
|
Recruit
Join Date: Jul 2006
Location: USA
Posts: 17
|
In general...and in particular.
You mentioned a stack overflow and you are allocating large sparse arrays in C++. That could be due to your C++ compiler writing an application that is trying to store things on your operating system's stack that don't fit.
Please cut-n-paste the exact error you are getting.
Also, what compiler are you using? - Visual Studio?
- .Net Studio?
- GCC?
- a commercial compiler?
on what target platform - Windows v?
- Linux distro ?
- BSD / netBSD, OpenBSD?
- SunOS x.y?
- AIX?
It would be nice to have the teacher's code excplicitly labeled (and your code labeled.)
When trying to understand the (sometimes defective) code given by a professor, be aware of the two levels of understanding needed. - Understand the language-level operations and flow of data: additions, substractions, pointer defrecerencing. Also, if the professor gave you code he just whipped up for the assignment, it may suffer from serious implementation defects. (There are often easy to make syntax errors and systematic logic errors with complex languages that a professor may gloss over.)
- Understand the why of a language-level construct: we increment the variable i no because you are iterating through an array, but because you are looking for something in that array. This is the most powerfull understanding since it lets you refactor (idempotenetly substitute architecture throughout) for functionaility.
The reason I bring up defective professor code, is that you are having stack overflows and you are using a sparse array to store a hash.
The code:
Code:
* Hash Function, requires Last name to hash */
int Database::Hash(char *sLast)
{
int sum = 0, i;
/* hashes key using the ASCII values of each letter in the name.
to create a good hash, each subsequent letter is weighted heavier
and heavier than the prior ones */
for(i = 0; sLast[i]; i++)
sum = sum * i + sLast[i];
return abs(sum) % HASHSIZE; /* used abs() in case of sum getting too large and wrapping around as a negative */
}
This hash garuntees a low level of collision (where you are trying to put two values at the same location) but uses vast amounts of memory for even tiny lists (this is why hashes are usually introduced after linked-lists in a data structures class) on your platform. If your program is compiling, run it in a debugger with memory profiling on. You should not be surprised at the amount of wasted space.
With classroom assignments, you are given trivial problems in simple contexts to resolve bugs and implement solutions. The closer to graduation you get the more team-oriented programming should get. I am assumeing teamwork is banned. I recommend doing what professionals with non-trivial problems in complex contexts do. Go ask you professor how he wants you to use his code.
I don't mean to be demeaning, but Code Newbie really needs a section entitled 'homework assignments' with nothing but a helpful pointer to University of Penn's good cheating policy page or Wisconsin - Madison's plagarism site. There is only so far anybody can go to help someone on a homework assignment. It sucks, but it beats getting an F for a working program.
|
|
|
10-11-2006, 08:34 PM
|
#5 (permalink)
|
|
Recruit
Join Date: Sep 2006
Posts: 10
|
Quote:
|
Originally Posted by QUantumAnenome
Variables created on the stack take up space on the stack, so when you need large arrays use malloc or new.
|
Can you elaborate on that? I've only had one programming class before this one and am not sure where you mean for me to implement that.
Thank you
|
|
|
10-11-2006, 08:40 PM
|
#6 (permalink)
|
|
Recruit
Join Date: Sep 2006
Posts: 10
|
Quote:
|
Originally Posted by waveclaw
You mentioned a stack overflow and you are allocating large sparse arrays in C++. That could be due to your C++ compiler writing an application that is trying to store things on your operating system's stack that don't fit.
Please cut-n-paste the exact error you are getting.
Also, what compiler are you using? - Visual Studio?
- .Net Studio?
- GCC?
- a commercial compiler?
on what target platform - Windows v?
- Linux distro ?
- BSD / netBSD, OpenBSD?
- SunOS x.y?
- AIX?
It would be nice to have the teacher's code excplicitly labeled (and your code labeled.)
When trying to understand the (sometimes defective) code given by a professor, be aware of the two levels of understanding needed. - Understand the language-level operations and flow of data: additions, substractions, pointer defrecerencing. Also, if the professor gave you code he just whipped up for the assignment, it may suffer from serious implementation defects. (There are often easy to make syntax errors and systematic logic errors with complex languages that a professor may gloss over.)
- Understand the why of a language-level construct: we increment the variable i no because you are iterating through an array, but because you are looking for something in that array. This is the most powerfull understanding since it lets you refactor (idempotenetly substitute architecture throughout) for functionaility.
The reason I bring up defective professor code, is that you are having stack overflows and you are using a sparse array to store a hash.
The code:
Code:
* Hash Function, requires Last name to hash */
int Database::Hash(char *sLast)
{
int sum = 0, i;
/* hashes key using the ASCII values of each letter in the name.
to create a good hash, each subsequent letter is weighted heavier
and heavier than the prior ones */
for(i = 0; sLast[i]; i++)
sum = sum * i + sLast[i];
return abs(sum) % HASHSIZE; /* used abs() in case of sum getting too large and wrapping around as a negative */
}
This hash garuntees a low level of collision (where you are trying to put two values at the same location) but uses vast amounts of memory for even tiny lists (this is why hashes are usually introduced after linked-lists in a data structures class) on your platform. If your program is compiling, run it in a debugger with memory profiling on. You should not be surprised at the amount of wasted space.
With classroom assignments, you are given trivial problems in simple contexts to resolve bugs and implement solutions. The closer to graduation you get the more team-oriented programming should get. I am assumeing teamwork is banned. I recommend doing what professionals with non-trivial problems in complex contexts do. Go ask you professor how he wants you to use his code.
I don't mean to be demeaning, but Code Newbie really needs a section entitled 'homework assignments' with nothing but a helpful pointer to University of Penn's good cheating policy page or Wisconsin - Madison's plagarism site. There is only so far anybody can go to help someone on a homework assignment. It sucks, but it beats getting an F for a working program.
|
The only code from my professor is the lines for the command line arguments, but i'm not sure how to make it do what I need. I need to have it read 2 arguments, the first being the hash table size (which needs to be about 60,000 for this particular application) and the 2nd argument is the file name to read from. I believe what he gave me is to just read 1 argument which is the file name and then load it.
And teamwork isn't banned by this guy, as long as 2 of us don't end up with identical code. He tells us to have our friends check it and such if we have problems, but nobody I know has gotten this to work either, or I wouldn't be asking here because I know how bad people get heckled about asking for help on homework. My teacher isn't much help to me as I've asked him about this several times already with no real assistance as to how to fix it.
My program works for the size of stuff that it currently is coded to do. But I need it to expand the hashsize to 60,000 and maxentries to about 20,000.
I'm using Visual Studio 6.0 and it's for Windows platform.
The error i get is not a compile error, but when it runs it says stack overflow and crashes if i set the stack size to anything above 11,000 or so.
|
|
|
10-15-2006, 10:59 AM
|
#7 (permalink)
|
|
Recruit
Join Date: Jul 2006
Location: USA
Posts: 17
|
look up the examples
Get really used to passing around pointers to classes and function pointers and dynamic memory allocation. Most of the non-trivial topics in a data structures class will need them.
Quote:
Variables created on the stack take up space on the stack, so when you need large arrays use malloc or new.
Can you elaborate on that? I've only had one programming class before this one and am not sure where you mean for me to implement that.
|
The Linux kernel, GNOME desktop and most of Microsoft Windows is written in C and C++. So, for C/C++ programming, you can often find a lot of information from the Linux manual pages (manpage's) which can be found in a full Linux installation or online at various places ( LMPO is just one.) The below summary comes from the manpage for malloc.
Code:
NAME
calloc, malloc, free, realloc - Allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *calloc(size_t nmemb, size_t size);
void *malloc(size_t size);
void free(void *ptr);
void *realloc(void *ptr, size_t size);
This page goes on for quite a while about how to use malloc(), free(), etc.
Quote:
|
My program works for the size of stuff that it currently is coded to do. But I need it to expand the hashsize to 60,000 and maxentries to about 20,000.
|
Programmers are taught to think of functions like they came out of a math textbook. In reality your operating system uses them only as labels in a huge stack to keep track of where in your program and all other programs the OS is. The OS keeps function parameters and a slot for your return data on this stack too.
This stack can only hold so many things. Each thing can only be of a limited size. These limits are usually hardcoded by the OS developer or just the limit of the real (not swap) memory on your system. You are making things that are too big to fit on your OS'es stack with the RAM you have. You are probably also making too many things for each stack entry (i.e. per function.)
The solution to this is dynamic memory access as QUantumAnenome and others have mentioned. I seriously doubt that your professor was smart enough to realize his data structures assignment depended on material he did not cover. Dynamic memory allocation is an important part of 'real-world' data structures and as you have probably discovered, is a large topic on it's own.
Quote:
|
I'm using Visual Studio 6.0 and it's for Windows platform.
|
Since you are using VS, you should search the web or MSDN (student's and users of MS VS should get access to this for free) for information about new() and malloc(). Beware: Microsoft is well known for having a serverly hackish memory management system to accomodate the 1,000s of poorly written and in high-demand Windows-only applications (games, tax software, etc.) Since they are closed-source, Microsoft can't fix those apps, instead they work around it.
I hope you have time to do more than cut-n-paste from the examples. There are a lot of issues with using malloc(), new() and delete() which are not obvious: no initialization, corruption of memory, buffer-overflows, double free() corruption, memory-boundary conditions, etc.
Code:
...
type_i_defined *pointer_to_that_type;
pointer_to_that_type = (type_i_defined*) malloc(sizeof(type_i_defined));
// or for an array
int number_in_array = 10; // whatever constant you want
type_i_defined *pointer_to_array_of_that_type;
pointer_to_array_of_that_type =
(type_i_defined*) calloc(sizeof(type_i_defined), number_in_array);
// calloc sets all the memory to zero (malloc won't)
...
If you are wondering, this dynamic memory comes from a data structure called a heap which is managed by your OS. It also cannot exceed your available memory but can grow as large as your swapfile + RAM allows. FYI, a heap is usually how one impements a hash table (unless you only want to hack together a simple linked list.)
|
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -8. The time now is 03:52 PM.
|
Copyright © 2000-2008, Milano Interactive
Web Hosting provided by Portal 360 Web Hosting
|
 |
|