Sunday, May 20, 2012
Crawl a Site after Login
Introduction
This post is about how to handle cookie and post request to login and crawl some private content.
Prerequest
The LoginCrawler is based on SimpleCrawler, please check it first
http://ben-bai.blogspot.com/2012/04/java-simple-web-crawler.html
About the form post
Assume a form in a page as follows
<form>
<input name="userName" />
<input name="passWord" />
<form>
your user name is 'someone' and password is '123', to post request to login,
the parameters is "userName=someone&passWord=123".
Please note the flow of different site may different, the LoginCrawler is just tested with MediaWiki system.
The Program
LoginCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/LoginCrawler.java
Download
SimpleCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/SimpleCrawler.java
LoginCrawler
https://github.com/benbai123/JSP_Servlet_Practice/blob/master/Practice/JAVA/Net/src/test/LoginCrawler.java
Reference
http://docs.oracle.com/javase/1.5.0/docs/guide/deployment/deployment-guide/cookie_support.html
Subscribe to:
Posts (Atom)