Code Newbie
News     Forums     Search     Members     Sign Up    

My Code Newbie
Username

Password

Articles/Snippets
ASP Classic
ASP.NET
C
C#
C++
HTML / CSS
Java
Javascript
Linux / BSD
Perl
PHP
Python
Ruby
SQL
VB 6
VB.NET

C.N. Friends
  Planet Rome

Link to Us!
Code Newbie
  Code Newbie
    forums
Old 06-01-2005, 02:36 AM   #1 (permalink)
j.gohel
Code Monkey
 
Join Date: Apr 2005
Posts: 68
j.gohel is on a distinguished road
Need guidance regarding a program

Hello Sir,

I am making an program in Java in which from an existing website

i want to fetch the contents of the page & store it in a database table

using JDBC.

Its like, using the URL class of java.net.* package I am calling

a particular URL & opening a connection to it using URLConnection

class.

Now what am i trying to do is in the web page there are particular main

sections having particular information regarding that section(for e.g

Computers,Notebooks,Tablets,etc).

Computers Notebooks Tablets
Detail11 Detail21 Detail31
Detail12 Detail22 Detail32
Detail13 Detail23 Detail33
Detail14 Detail24 Detail34

All the details12,12 ,13, 14,etc are hyperlinks

Now for each main section the detailed information is provided in the

form of a hyperlink which when clicked displays the information about

that particular section.

So what i want to do is fetch the contents of a particular section's

each hyperlink provided & parse that contents & store them in a

database table.

So how to approach regarding this using only Core Java API.

Waiting for the reply.

Thanking you,
Jignesh
j.gohel is offline   Reply With Quote
Old 06-01-2005, 03:53 PM   #2 (permalink)
Belisarius
Java fanboy
 
Belisarius's Avatar
 
Join Date: Aug 2003
Posts: 1,139
Belisarius is on a distinguished road
What you're wanting to do is called screen scrapping. The problem isn't so much "How do I do this in Java" as "How is this done in general?" I don't know how it's implemented elsewhere, but hitting Google using the keywords "screen scrapping" would be your best first step.
__________________
GitS
Belisarius is offline   Reply With Quote
Old 06-02-2005, 11:54 AM   #3 (permalink)
technobard
Centurion Nova Prime
 
technobard's Avatar
 
Join Date: May 2002
Location: Oak Park, IL (USA)
Posts: 285
technobard is on a distinguished road
Quote:
Originally Posted by j.gohel
So how to approach regarding this using only Core Java API.
It sounds like you've already outlined a general approach. The only big piece missing is parsing. You seem to know what your pages will look like which helps a lot. URLConnection will allow you to get your HTML page as a string or stringbuffer (I don't remember which). After retrieving your page, I'd do something like:

1) search for the beginning of a section
2) Search for the end of that section
3) Create a substring based on the beginning and ending of the section
4) Search the substring for links and store them in a List
5) Loop through the List of substrings to retrieve the HTML and load it into a database
6) Repeat starting at Step 1 until all of the sections are gone.

As Belisarius pointed out, there are screen scraping tools out there that probably make this process easier (once you figure out the api). If you're intent on doing it yourself, the steps I mentioned along with what you've already outlined should get you there eventually.

Note: This is obviously not the only way to handle this. For example, creating a substring for each section isn't strictly necessary. It just makes your search simpler in that you don't have to make sure that it stays within a section boundary. Likewise, you could build a List of sections similar to the List of links and loop. You get the idea.

Good luck!
__________________
It takes 2 points to draw a straight line, but at least 3 points to draw a conclusion.
technobard is offline   Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
need help with copying backwards rogue Standard C, C++ 9 04-24-2005 04:39 PM
C++ Deadlock Detection Program Help... coolsc81 Standard C, C++ 2 10-26-2004 06:14 AM
Help on starting new program B00tleg Standard C, C++ 21 10-17-2004 12:58 PM
Need help on program B00tleg Standard C, C++ 1 10-12-2004 12:02 AM


All times are GMT -8. The time now is 01:17 PM.


Powered by vBulletin® Version 3.7.0
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.0.0 RC8





Copyright © 2000-2008, Milano Interactive
Web Hosting provided by Portal 360 Web Hosting