org.niocchi.core
Class Crawler

java.lang.Object
  extended by org.niocchi.core.Crawler

public class Crawler
extends java.lang.Object


Field Summary
 int address_total_time
           
 int connection_total_time
           
 int incomplete_count
           
 int internal_error_count
           
 int processed_count
           
 int read_total_time
           
 int redirected_count
           
 int select_total_time
           
 long start_time
           
 int status_200
           
 int status_other
           
 int timeout_count
           
 int write_total_time
           
 
Constructor Summary
Crawler(ResourceFactoryInt res_factory_, int max_channels_)
          Create a new Crawler instance.
 
Method Summary
 int getConnectionTimeout()
          Return the current connection timeout.
 int getReadTimeout()
          Return the current read timeout.
 RedirectionController getRedirectionController()
          Return the current redirection filter that the crawler is using, Null if there isn't a redirection filter.
 int getSelectTimeout()
          Returns the current select timeout.
 java.lang.String getUserAgent()
          Returns the user agent.
 void interruptCrawling()
          Interrupts the crawling in a clean and relative imediate way.
 void printMonitoredState(java.io.PrintStream out_)
          write some crawl statistics.
 void run(URLPool url_pool_)
          Start the crawl.
 void setAllowCompression(boolean allowCompresion)
          Set content compression on/off.
 void setConnectionTimeout(int timeout_)
          Set the connection timeout.
 void setNegativeResolutionTTL(int ttl_)
           
 void setReadTimeout(int timeout_)
          Set the read (data reception) timeout.
 void setRedirectionController(RedirectionController controller_)
          Sets the new RedirectionController.
 void setSelectTimeout(int timeout_)
          Set the timeout for the selection of ready channels.
 void setTimeout(int timeout_)
          Set the connection timeout and the read (data reception) timeout.
 void setUserAgent(java.lang.String ua_)
          Set the user agent.
 void setVerbose()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

start_time

public long start_time

processed_count

public int processed_count

status_200

public int status_200

status_other

public int status_other

redirected_count

public int redirected_count

incomplete_count

public int incomplete_count

internal_error_count

public int internal_error_count

connection_total_time

public int connection_total_time

read_total_time

public int read_total_time

write_total_time

public int write_total_time

select_total_time

public int select_total_time

address_total_time

public int address_total_time

timeout_count

public int timeout_count
Constructor Detail

Crawler

public Crawler(ResourceFactoryInt res_factory_,
               int max_channels_)
        throws java.io.IOException
Create a new Crawler instance. The crawler doesn't know the exact class of resource used. Therefore it uses the resource factory to generate the resources that will be associated to the queries. The number of instances of resource is bounded by the max_channels_ parameter.

Parameters:
res_factory_ - the resources factory
max_channels_ - the maximum number of used channels (therefore of used resources).
Throws:
java.io.IOException
Method Detail

setNegativeResolutionTTL

public void setNegativeResolutionTTL(int ttl_)

setVerbose

public void setVerbose()

setUserAgent

public void setUserAgent(java.lang.String ua_)
Set the user agent. By default the crawled doens't send any user agent.

Parameters:
ua_ - the user agent.

getUserAgent

public java.lang.String getUserAgent()
Returns the user agent. By default the crawler doesn't send any user agent.

Returns:
the user agent.

setAllowCompression

public void setAllowCompression(boolean allowCompresion)
Set content compression on/off. If compression is on, servers that support it can send the contend compressed.

Parameters:
allowCompresion -

run

public void run(URLPool url_pool_)
         throws java.io.IOException
Start the crawl.

Parameters:
url_pool_ -
Throws:
java.io.IOException

getRedirectionController

public RedirectionController getRedirectionController()
Return the current redirection filter that the crawler is using, Null if there isn't a redirection filter. If you call this method after the instantiation of the Crawler, you'll get an object of the class HostRedirectionController.
You can use the configuration methods of HostRedirectionController to configure the redirections of the crawler, or you can implement your own RedirectionController and use the method setRedirectionController(RedirectionController) to give your own policy. It's recommend that you includes in your own policy the old one by checking the method RedirectionController#filter(Query, URL) of the old redirectionFilter before your code of filtering.

See Also:
setRedirectionController(RedirectionController)

setRedirectionController

public void setRedirectionController(RedirectionController controller_)
Sets the new RedirectionController. The RedirectionController is used by the crawler to deduce wither or not to follow a redirection. If you want you your own policy about redirections, implement your the RedirectionController class.

See Also:
getRedirectionController()

interruptCrawling

public void interruptCrawling()
Interrupts the crawling in a clean and relative imediate way. The currently crawling queries will be finished before stopping.


setTimeout

public void setTimeout(int timeout_)
Set the connection timeout and the read (data reception) timeout. Default = 1000ms.

Parameters:
timeout_ - the time in millisecond.

setSelectTimeout

public void setSelectTimeout(int timeout_)
Set the timeout for the selection of ready channels. Default = 1000ms.

Parameters:
timeout_ - the time in millisecond.

setConnectionTimeout

public void setConnectionTimeout(int timeout_)
Set the connection timeout. Default = 1000ms.

Parameters:
timeout_ - the time in millisecond.

setReadTimeout

public void setReadTimeout(int timeout_)
Set the read (data reception) timeout. Default = 10000ms

Parameters:
timeout_ - the time in millisecond.

getSelectTimeout

public int getSelectTimeout()
Returns the current select timeout.

Returns:
the time in millisecond.

getConnectionTimeout

public int getConnectionTimeout()
Return the current connection timeout.

Returns:
the time in millisecond.

getReadTimeout

public int getReadTimeout()
Return the current read timeout.

Returns:
the time in millisecond.

printMonitoredState

public void printMonitoredState(java.io.PrintStream out_)
write some crawl statistics.

Parameters:
out_ -