Code Newbie
News     Forums     Search     Members     Sign Up    

My Code Newbie
Username

Password

Articles/Snippets
ASP Classic
ASP.NET
C
C#
C++
HTML / CSS
Java
Javascript
Linux / BSD
Perl
PHP
Python
Ruby
SQL
VB 6
VB.NET

C.N. Friends
  Planet Rome

Link to Us!
Code Newbie
  Code Newbie
    forums

Go Back   Code Forums > Application and Web Development > Standard C, C++

Reply
 
LinkBack Thread Tools Display Modes
Old 07-21-2006, 09:41 AM   #1 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Plain Text to Tab Delimited

Hello,

I am trying to write a program to convert plain text to a tab-delimited version of that text.

I know a way to hack around using substrings and charat functions etc., but if anyone has any efficient suggestions or solutions that would be a great help. Also I am not limited to C / C++ if this solution would be easier using a different language.

Thanks,
Tom
tpennetta is offline   Reply With Quote
Old 07-21-2006, 10:52 AM   #2 (permalink)
teknomage1
Jack of all trades
 
teknomage1's Avatar
 
Join Date: Feb 2005
Location: Los Angeles
Posts: 598
teknomage1 is on a distinguished road
Send a message via AIM to teknomage1
Where do you want the tabs to separate the text? Between every word? Every line?
__________________
Stop intellectual property from infringing on me
teknomage1 is offline   Reply With Quote
Old 07-21-2006, 03:10 PM   #3 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Thanks for the reply.

Sorry to confuse, but I have just found out that the file does not have to be converted from plain text to tab delimited, but to comma delimited format.

There are several added confusions, it almost seems as though the original format is seperated by columns, but is actually a bunch of spaces. Also instead of seperating the fields by columns they are seperated by rows. It seems to be quite a challenge.

If requested I will put up a test version of the original format.

Thanks,
Tom
tpennetta is offline   Reply With Quote
Old 07-21-2006, 10:36 PM   #4 (permalink)
waveclaw
Recruit
 
waveclaw's Avatar
 
Join Date: Jul 2006
Location: USA
Posts: 18
waveclaw is on a distinguished road
Send a message via ICQ to waveclaw Send a message via AIM to waveclaw Send a message via MSN to waveclaw Send a message via Yahoo to waveclaw
Quote:
Also I am not limited to C / C++ if this solution would be easier using a different language.
C and C++ are not very fun languages with which to write string maniplating programs. You will spend at least 1/2 your time dealing with memory allocation, leaks and buffer manangment (overflows, underflows, fix vs dynamic allocation, magic size numbers.)

Perl is available for both Microsoft and other platforms (Linux, Solaris, etc) and is well suited to text translation. There are some who will encourage you to learn Python instead. However, to effectively use these languages you will need to know Regular Expression syntax (arguable a language all it's own.)

Do you have access to a Unix environment?

There are several tools including tr, awk, sed, cut and paste that make text conversion easy if you can get predictable and regular strings. awk and sed have their own lightweight languages suited to line-by-line text editing. For example:

Code:
#translate all pairs of spaces into tabs
tr '  ' '\t' < file.in > file.out
or
Code:
#use the stream editor (sed) to replace streches of whitespace with commas
sed -e 's/\s\s+/,/' <file.in > file.out

This particular 'project' is a common one used for training people new Unix: given a text file with some symbol throughout it, replace it with another. Introductory Unix texts and most automation, scripting or system administration books include examples of how to do this. For example, off the top of my head, a book on Perl might have:

Code:
perl -p -e 's/\s\s+/\t/g' my_file.txt > my_file.tsv
or
Code:
perl -p -e 's/\s\s+/,/g'  my_file.txt > my_file.csv
as an example of redirecting stdout to a file in a Unix shell (> my_file) and the -p (iterate over lines in a file) and -e 'text' (execute script fragment 'text' on the command line) options for perl.

Note that the 's/\s\s+//g' is a PCRE (Perl-compatible Regular Expression.) It is the substituting (s///) command for Perl and sed. It has been given the global option (g) to replace all strings on a given line. It describes any string composed of a whitespace character (\s) followed by one or more whitespace characters (\s+). It will replace the whitespace (spaces, tabs, vertical tabs, pretty much anything not a-zA-Z0-9 or a symbol) with the given character (a tab in the first example and comma in the second.) FYI, comma-seperated values can be complicated, esp. if you already have commas in your input. In that case, you will have to wrap enties in quotes or escape the comma. As I can bet this is being fed into Microsoft Excel(tm), you won't be able to use escape codes.
waveclaw is offline   Reply With Quote
Old 07-22-2006, 08:59 AM   #5 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Thank you for the help thus far. Perl is probably my best choice for this solution, but now for the added confusion. Like I said the format was very unorthodox.

The original format is:

http://web.njit.edu/~tmp8/testorig.txt

The desired format is:

Quote:
LAST,FIRST,132135,ADDRESS 1,ADDRESS 2 OR APPT,CITY,NJ,07444
LAST,FIRST,006273,ADDRESS 1,,CITY,NJ,08022
As you can see, the file is not organized easily, and if there is nothing provided for some fields to just leave them blank. Any help is greatly appreciated again. Thank you.

Tom
tpennetta is offline   Reply With Quote
Old 07-22-2006, 09:22 AM   #6 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Sorry, I have an updated link sorry to double post:

http://www.njelectronics.com/testorig.txt
tpennetta is offline   Reply With Quote
Old 07-22-2006, 12:34 PM   #7 (permalink)
teknomage1
Jack of all trades
 
teknomage1's Avatar
 
Join Date: Feb 2005
Location: Los Angeles
Posts: 598
teknomage1 is on a distinguished road
Send a message via AIM to teknomage1
Wow, that's a fun format to deal with, and by fun I mean difficult.
I'm going to go ahead and guess further that there are fixed field lengths for each category so instead of needing to aply the full power of a regular expression library you just need to read in a fixed number of characters at a time for each field. I would imagine the people who sent you the text format would have the field lengths, but worst case you can just measure yourself from the text file.
It looks like you need to be able to process info for a couple of records at a time since they share line space. Some psuedo code for your data collection might look like
Code:
std::string readNameField (int offset, std::string inputLine) {
    return subseq(offset, NAMEFIELD_LENGTH, inputLine);
}
//etc... for all your data fields
int main() {
  //read in a line
  //allocate 4 new structs to hold the address data
  //update counter that tells us what type of line we're dealing with (names, addresses, cities, etc
   //switch cases depending on the counter above
     //run the appropriate "parsing" routines and store data in the proper structs
   //when the counter gets to the last line, print out the contents of the structs in the new format, and clear the values of the structs for new input
}
__________________
Stop intellectual property from infringing on me
teknomage1 is offline   Reply With Quote
Old 07-22-2006, 01:15 PM   #8 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Yea looks like its back to C++. Thanks for your help I was wondering what others thought would be a good approach.

It sounds like a relatively small program just a few hurdles. Thanks for all your help and input, hopefully I can help you in the future.

Tom
tpennetta is offline   Reply With Quote
Old 07-24-2006, 02:27 AM   #9 (permalink)
Valmont
[code][/code] enforcer
 
Valmont's Avatar
 
Join Date: Mar 2003
Location: Netherlands
Posts: 1,544
Valmont is on a distinguished road
I looked at a few minutes ago. I don't have problems coding your needs but you need to define the problem better.

1) Define your current standard.
2) -->> done -->> define new standard.
__________________

Last edited by Valmont; 07-24-2006 at 04:10 AM.
Valmont is offline   Reply With Quote
Old 07-24-2006, 03:41 AM   #10 (permalink)
Valmont
[code][/code] enforcer
 
Valmont's Avatar
 
Join Date: Mar 2003
Location: Netherlands
Posts: 1,544
Valmont is on a distinguished road
Oh jolly, now I see the problem .

Your file problem isn't solvable because the format isn't well defined.
Maybe it is but you didn't provide the definition.

The problem is that a computer can't know the difference between a person's name and a company name. Eventually, for a database to be a database, something needs to be delimited by something. This isn't the case here.

Did you export the file from a database? From which? How did you export it?
I can't imagine the export went ok.
__________________

Last edited by Valmont; 07-24-2006 at 04:11 AM.
Valmont is offline   Reply With Quote
Old 07-24-2006, 06:25 AM   #11 (permalink)
tpennetta
Registered User
 
Join Date: Jul 2006
Posts: 6
tpennetta is on a distinguished road
Thanks for your post but you seem to reading too much into this or not enough.

The problem is definitly solvable.
The problem is that the original text has no specific delmited format, it seems to just be fixed with columns with all the data fields seperated by rows instead of columns.

As for the computer not knowing the difference between names and company names, it will know the difference when I'm through with it. All that takes is an added check to see if a comma is present in the original field and parse it accordingly.

The file has not been exports and looks to have been typed out by hand this way.

Thanks.
tpennetta is offline   Reply With Quote
Old 07-24-2006, 11:31 AM   #12 (permalink)
Valmont
[code][/code] enforcer
 
Valmont's Avatar
 
Join Date: Mar 2003
Location: Netherlands
Posts: 1,544
Valmont is on a distinguished road
The problem is that all fields of the records seem to be optional.
You'll need to provide a better definition then "fixed with columns".

The columns-part I could find out myself.

Example:
For the first record, both ADDRESS 1 and ADDRESS 2 are provided. But for the second record ADDRESS 2 is missing. Yet everything is space delimited. So how would the computer know that the next <string> isn't a "ADDRESS 2" ?
For the system to know, there must be a clear definition. Obviously that has to do with delimiters. Note that the system doesn't have an intuition like we do.
__________________
Valmont is offline   Reply With Quote
Old 07-24-2006, 12:08 PM   #13 (permalink)
teknomage1
Jack of all trades
 
teknomage1's Avatar
 
Join Date: Feb 2005
Location: Los Angeles
Posts: 598
teknomage1 is on a distinguished road
Send a message via AIM to teknomage1
I think that was supposed to be "fixed wiDth", Valmont.
__________________
Stop intellectual property from infringing on me
teknomage1 is offline   Reply With Quote
Old 07-24-2006, 12:14 PM   #14 (permalink)
Valmont
[code][/code] enforcer
 
Valmont's Avatar
 
Join Date: Mar 2003
Location: Netherlands
Posts: 1,544
Valmont is on a distinguished road
I've checked for that as well. There's a problem with consistancy. A clear definition still is missing.
I could do it codingwise easely but I need definitions.
__________________
Valmont is offline   Reply With Quote
Old 07-24-2006, 12:34 PM   #15 (permalink)
redhead
Newbie
 
redhead's Avatar
 
Join Date: Jun 2002
Location: Denmark
Posts: 1,720
redhead is on a distinguished road
hmm.. from what I can see, the first column is containing 25 chars, the second 11 chars, the third 25 chars, the fourth 11 chars, the fifth 25 chras, the sixth 11 chars, the seventh 25 chars and the eight 11 chars.
And theres 4 rows describing each seperate "four infos"

So from my point of view, the definition of it is something like:
Code:
typedef struct{
    char first[25];
    char second[11];
}info; 
std::ifstream ifp("in_file.txt");
std::ofstream ofp("out_file.txt);
std::vector <info> list;
std::string str;
while(! ifp.eof()){
/* read four lines at a time */
  for(int i=0; i < 4; ++i){
    std::getline(ifp, str);
    info tmp;
  /* split each line up into four columns consisting of two combined infos */
    for(int j=0; j < 4; ++j){
        strncpy(tmp.first, str.substr(j*25+j*11, 25), 25);
        strncpy(tmp.second, str.substr(25+j*25+j*11, 11), 11);
        list.add(tmp);
    }
  }
/* rearange the locations of the read info, to reflect the wanted format */
  for(int i=0; i < 4; ++i){
    ofp << list[i].first << ", " <<  list[i].second << ", " 
          << list[i+4].first << ", " << list[i+4].second << ", "
          << list[i+8].first << ", " << list[i+8].second << ", "
          << list[i+12].first << ", " << list[i+12].second << std::endl;
  }
/* make sure nothing is left in our list, so we wont repeat the output */
  list.clear();
}
Now I havn't tested any of this, but it looks to me like you have to assume there will allways be these restrictions on the input strings, so I left out all the error checking.
If you wanted you could see if the info stored in first/seond is all spaces, then skip printing it, and you'll end up with your comma seperated info, where the empty fields wont be represented.
__________________
Don't worry Ma'am, We're university students, We know what We're doing.
-----
If you pull the pin, Mr.Grenade would no longer be your friend.
-----
01000111 01101111 00100000 01000011 00100000 00100001
redhead is offline   Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Include file shown correctly in FF not in IE Stoner HTML, XML, Javascript, AJAX 2 04-21-2006 07:56 AM
read tab delimited file sde PHP 2 09-13-2005 09:18 PM
Text Searching freesoft_2000 Java 2 12-02-2004 07:38 AM
[MySQL] blob vs text redhead PHP 2 03-29-2004 11:15 PM


All times are GMT -8. The time now is 11:31 AM.


Powered by vBulletin® Version 3.7.0
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.0.0 RC8





Copyright © 2000-2008, Milano Interactive
Web Hosting provided by Portal 360 Web Hosting