Niocchi

Niocchi is a java crawler library implementing synchronous I/O multiplexing. This specific type of implementation allows crawling tens of thousands of hosts in parallel on a single low end server. Niocchi has been designed for big search engines that need to crawl massive amount of data, but can also be used to write no frills crawlers. It is currently used in production by Enormo and Vitalprix.

javadoc

Index

  1. Introduction
  2. Requirements
  3. License
  4. Package organization
  5. Architecture
  6. Usage
  7. Caveats
  8. To Do
  9. Download
  10. Change history
  11. About the authors

Introduction

Most of the java crawling libraries use standard java IO package. That means crawling N documents in parallel requires at least N running threads. Even if each thread is not taking a lot of resources while fetching the content, that approach becomes costly when crawling at a large scale. On the contrary, doing synchronous I/O multiplexing by using the NIO package introduced in java 1.4 allows the crawling of many documents in parallel using one single thread.

Requirements

Niocchi requires java 1.5 or above.

License

This software is licensed under the Apache license version 2.0.

Package organization

Architecture

This architecture is represented in the following diagram:

Usage

In order to use Niocchi, the following classes and methods must be implemented:

Worker

Subclass Worker and implement processResource(Query). This is where you do whatever needs to be done with the crawled content. Check the DiskSaveWorker class for a example of implementation.
You will usually instanciate 1 worker per CPU core.

URLPool

Implement the URLPool interface.

Optionally, the following classes may be implemeted:

A Resource and a ResourceCreator

if the URLPool must check the validity of the crawled content, subclass these classes and implement Resource.isValid() and ResourceCreator.createResource().

A Query

If you need to pass additional information to the Worker or the URLPool, subclass Query.

Caveats

To Do

Download

Niocchi 1.1
Niocchi 1.0

Change history

2011-07-10 V1.1

Warning, some changes make this version not backward compatible, though little modification of your code should be required to migrate.

New features:

2010-11-17 V1.0

Original release.

About the authors

Niocchi has been written by François-Louis Mommens and has received contributions from Iván de Prado Alonso and Marc Gracia.
François-Louis Mommens is also the co-author with Tom Dibaja of Linkody, a free online SEO tool that monitors your SEO backlinks 24/7 and sends you notifications when any link disappears or is changed.