Hedera

A Hadoop toolkit for processing large versioned document collections

Hedera is a software toolkit based on Hadoop framework that supports users in processing, indexing and mining text from very large versioned document collections. Hedera is designed to target rapid development and experimentation on scenarios, where it is required to deliver the (often preliminary) results as soon as possible, before going for a full-fledged solution. Hedera aims to support processing popular big datasets in public, such as Wikipedia revision history, Internet archive, etc. Hedera offers a few features such as:

While there are several frameworks built to support processing big textual data (both in MapReduce and non-MapReduce fashion), little has been focused on efficient processing of versioned document collections such as Web archives, revisions of collaborated documents (news articles or encyclopedia pages such as Wikipedia). As compared to traditional corpora, versioned documents have some following special characteristics:

Hedera was built with those questions in mind. It uses Hadoop frameworks to support the scalable and parallel processing of data in high level programming languages. It optimizes the operations and APIs to address the above challenges, while still conforms to the MapReduce standards and support traditional workflows. Hedera aims to support rapid development of experimental models, and to this extent it tries to rely less on heavy general-purpose frameworks built for enterprise environments (such as Impala, etc.)