 |
Niocchi |
Niocchi is a java crawler library implementing synchronous I/O multiplexing.
This specific type of implementation allows crawling tens of thousands of hosts in parallel on a single low end server. Niocchi has been designed for big search engines that need to crawl massive amount of data, but can also be used to write no frills crawlers. It is currently used in production by Enormo and Vitalprix.
javadoc
Index
- Introduction
- Requirements
- License
- Package organization
- Architecture
- Usage
- Caveats
- To Do
- Download
- Change history
- About the authors
Most of the java crawling libraries use standard java IO package.
That means crawling N documents in parallel requires at least N running
threads. Even if each thread is not taking a lot of resources while
fetching the content, that approach becomes costly when crawling at a
large scale. On the contrary, doing synchronous I/O multiplexing by using the NIO
package introduced in java 1.4 allows the crawling of many documents in
parallel using one single thread.
Niocchi requires java 1.5 or above.
This software is licensed under the Apache license version 2.0.
- org.niocchi.core holds the library itself.
- org.niocchi.gc holds an implementation example of a very simple crawler that reads the URL to crawl from a file and saves the crawled documents.
- org.niocchi.monitor holds a utility thread that can be used by the crawler to provide real time information through a telnet connexion.
- org.niocchi.rc holds an implementation example of a RedirectionController.
- org.niocchi.resources holds a few implementation examples of the Resource and ResourceCreator classes.
- org.niocchi.urlpools holds a few implementation examples of the URLPool class.
- A Query encapsulates a URL and implements methods to check its
status after being crawled.
- A Resource holds the crawled content and implements methods to
save it.
- Each Query is associated to a Resource. To crawl one URL, one
Resource needs to be taken from the pool of resources. Once the URL is
crawled and its content processed, the Resource is returned to the
pool. The number of available Resources is fixed and controls how many
URL can be crawled in parallel at any time. This number is set through
the ResourcePool constructor.
- When a Query is crawled, its associated Resource will be
processed by one of the workers.
- The URLPool acts as a source of URLs to crawl into which the
crawled taps. It's an interface that must be implemented to provide
URLs to the crawler.
- The crawler has been designed as "active", meaning it consumes
URLs from the URLPool, as opposed to being "passive" and waiting to be
given URL. When the crawler starts, it will get URLs to crawl from the
URLPool until all resources are consumed or hasNextQuery() return false
or getNextQuery() returns null. Each time a Query is crawled and
processed and its Resource returned to the ResourcePool, the crawler
requests other URLs to crawl from the URLPool until all resources are
consumed or hasNextQuery() return false or getNextQuery() returns null.
If all URLs have been crawled and no more are immediately available to
crawl, the crawler will recheck every second for available URLs to
crawl.
- When a Query has been crawled, it is put into a FIFO of queries
to be processed. One of the Workers will take it and process the
content of its associated Resource. The work is done in the
processResource() method. The Query is then returned to the URLPool
which can examine the crawl status and the result of the processing.
Lastly, the Query associated Resource is returned to the Resource pool.
- In order to not block during host name resolution, the Crawler
uses two additional threads. ResolverQueue resolves the URL coming from
the URLPool and RedirectionResolverQueue resolves the URLs gotten from
redirections.
This architecture is represented in the following diagram:
In order to use Niocchi, the following classes and methods must be implemented:
Worker
Subclass Worker and implement processResource(Query). This is where you do whatever needs to be done with the crawled content. Check the DiskSaveWorker class for a example of implementation.
You will usually instanciate 1 worker per CPU core.
URLPool
Implement the URLPool interface.
- getNextQuery() returns a Query to crawl or null if there isn't
any available Query yet.
- hasNextQuery() returns false only when there are no more queries
to crawl and the crawler must terminate after the last queries still
being crawled are processed.
- setProcessed(Query) is called by the crawler to inform the
URLPool when a Query has been crawled and its resource processed. This
is typically used by the URLPool to check the crawl status and log the
error in case of a failure or to get more URL to crawl in case of
success.
A typical example where getNextQuery() returns null but hasNextQuery()
returns true is when the URLPool is waiting for some processed
resources from which more URL to crawl have been extracted to come
back.
Check the urlpools package for examples of implementation.
Optionally, the following classes may be implemeted:
A Resource and a ResourceCreator
if the URLPool must check the validity of the crawled content, subclass these classes and implement Resource.isValid() and ResourceCreator.createResource().
A Query
If you need to pass additional information to the Worker or the URLPool, subclass Query.
- There is no politeness mechanism implemented. You have to implement you own mechanism in the URLPool.
- Each socket consumes one file descriptor. If you intend to crawl
a large number of documents in parallel and reach the default limit of
the system, you can raise it with ulimit.
- Complete the documentation.
- Implement the DNS protocol and do the resolution through the select loop in order to resolve in parallel and get rid of the 2 resolver threads.
Niocchi 1.1
Niocchi 1.0
2011-07-10 V1.1
Warning, some changes make this version not backward compatible, though little modification of your code should be required to migrate.
New features:
-
Disk storage is implemented. You can choose to have the crawler store the resources content directly to disk instead of memory. The DiskResource class implements this behavior. MemoryResource implements the old behavior of keeping the content in memory.
- Split the configurable 'timeout' value into 'select timeout', 'connection timeout' and 'read timeout'.
Previously, each channel was checked for timeout (read or connection) after a select timeout was triggered. A select timeout is triggered when none of the registered channel gets ready for a network operation (read or connection) before the timeout delay expires. That means channels could have remained uselessly blocked instead of being recycled until all the registered channels were getting a timeout.
- Introduced a new Query status: CONNECTION_ERROR. Previously, UNREACHABLE was used for both, connection errors and redirection issues (redirection limit reached, redirection not allowed...).
- The Resource class implements an isValid() method that returns always true.
2010-11-17 V1.0
Original release.
Niocchi has been written by François-Louis Mommens and has received contributions from Iván de Prado Alonso and Marc Gracia.
François-Louis Mommens is also the co-author with Tom Dibaja of Linkody, a free online SEO tool that monitors your SEO backlinks 24/7 and sends you notifications when any link disappears or is changed.