Niocchi
Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.
It is currently being used in production by Enormo to crawl thousands of websites daily, and by Vitalprix.

Index

  1. Introduction
  2. Requirements
  3. License
  4. Package organization
  5. Architecture
  6. Usage
  7. Caveats
  8. To Do
  9. Download
  10. About the Authors

Introduction

Most of the java crawling libraries use standard synchronous java IO. That means crawling N documents in parallel requires at least N running threads. Even if each thread is not taking a lot of resources while fetching the content, that approach becomes costly when crawling at a large scale. On the contrary, doing asynchronous I/O by using the NIO package introduced in java 1.4 allows the crawling of many documents in parallel using one single thread.

Requirements

Niocchi requires java 1.5 or above.

License

This software is licensed under the Apache license version 2.0.

Package organization

Architecture

This architecture is represented in the following diagram:

Usage

In order to use Niocchi, the following interface and abstract classes must be implemented.

Resource and ResourceCreator

Subclass these classes and implement Resource.isValid() and ResourceCreator.createResource() or use one of the two provided implementations for HTML and pictures types of resources.

Worker

Subclass Worker and implement processResource(Query). This is where you do whatever needs to be done with the crawled content. Check the DiskSaveWorker class for a example of implementation.
You will usually instanciate 1 worker per CPU core.

URLPool

Implement the URLPool interface.

Caveats

To Do

Download

Niocchi 1.0
Apache Commons Logging

About the Authors

Niocchi has been written by François-Louis Mommens and has received contributions by Iván de Prado Alonso and Marc Gracia