Hedera
Hedera is a framework to facilitate the rapid development of processing methods on big versioned document collections using the Hadoop framework. It provides customized InputFormat for accessing data in parallel in standard MapReduce workflows. Hedera can be used in Hadoop Streaming to enable rapid development using different languages (Java, Python, C, etc.) , and it also supports Pig with a number of User-defined functions (UDFs). At the moment, the framework has been tested in Hadoop CDH 4.x and Pig 0.11.x. It supports both Hadoop YARN and non-YARN models. Free for research and educational purpose under GNU and Common Creative License.
Getting started
- Instructions to work with Wikipedia revision history dataset here
Architecture
(documentation is being progressively updated here.)
License
Hedera is available under Apache License v2.0 and Creative Commons Contribution v3.0 license.
References
Hedera uses the following library for working with Wikipedia:
- Cloud9 : Hedera uses the WikipediaPage input formats and Writable implementations from Cloud9 tools
- Tuan4j: Tuan4j-distributed is a set of utilities for handling customized Java objects and setting up jobs in Hadoop environment. It also supports bz2 compression-decompression in reading mapper inputs streams.
- PigNLPRoc: PigNLPRoc is an execellent tool for working with Wikipedia article dumps in Pig. Hedera inherits the parsing code from PigNLPRoc and provides the same functionalities in Wikipedia revision dumps.
- WikiHadoop: WikiHadoop provides a fast and simple way to split XML files in Wikipedia revision history using SAX-like parsing manner. Unfortunately, it is poorly-documented, does not scale well with big compressed files, and does not provide fast and approximated reading. Hedera inherits the parssing idea of Wikihadoop in transformation phase (see architecture and API for more details) in a fully map reduce fashion, thus guarantees the scalability of handling compressed big files.
Copyright and Disclaimer
The tool is currently pretty much in active development phase, meaning that it can go through several updates per day, with new functionalities added or modified / improved. I am trying to make it stable every day, but if you find any issues regarding installation, running code, please drop me a message to: ttran (at) L3S (dot) de.