View Single Post
Old 06-02-2005, 12:54 PM   #3 (permalink)
technobard
Centurion Nova Prime
 
technobard's Avatar
 
Join Date: May 2002
Location: Oak Park, IL (USA)
Posts: 287
technobard is on a distinguished road
Quote:
Originally Posted by j.gohel
So how to approach regarding this using only Core Java API.
It sounds like you've already outlined a general approach. The only big piece missing is parsing. You seem to know what your pages will look like which helps a lot. URLConnection will allow you to get your HTML page as a string or stringbuffer (I don't remember which). After retrieving your page, I'd do something like:

1) search for the beginning of a section
2) Search for the end of that section
3) Create a substring based on the beginning and ending of the section
4) Search the substring for links and store them in a List
5) Loop through the List of substrings to retrieve the HTML and load it into a database
6) Repeat starting at Step 1 until all of the sections are gone.

As Belisarius pointed out, there are screen scraping tools out there that probably make this process easier (once you figure out the api). If you're intent on doing it yourself, the steps I mentioned along with what you've already outlined should get you there eventually.

Note: This is obviously not the only way to handle this. For example, creating a substring for each section isn't strictly necessary. It just makes your search simpler in that you don't have to make sure that it stays within a section boundary. Likewise, you could build a List of sections similar to the List of links and loop. You get the idea.

Good luck!
__________________
It takes 2 points to draw a straight line, but at least 3 points to draw a conclusion.
technobard is offline   Reply With Quote