Hedera
A Hadoop toolkit for processing large versioned document collections
Hedera and Wikipedia Revision
Wikipedia is perhaps one of the most popular datasets among the research community of text mining. Wikipedia has several unique features distinguishing it from other dataset: It contains rich information compiled by human, it covers a wide range of topics, from scientific knowledge to celebrity's bibliography and so on, and it is free. Compared to Wikiepdia text, Wikipedia revision history includes a much broader spectrum of data: Articles, talks, discussions, etc. It also contains a complete set of snapshots for each Wikipedia pages, making it aa gold mine for many studies that are concerned with the dynamics of Wikipedia contents over time.
Unfortunately, while there are several tools supporting processing Wikipedia text dumps in centralized and distributed settings (for example, Cloud9 is an excellent tool that is based on Hadoop framework), the access to the Wikipedia revision history is still prohibited because of their enormous volumes (tens of Terabytes of text). Wikipedia revision history datasets are formatted in a different way from Wikipedia text, and it contains several compressed files, which makes tools such as Cloud9 difficult to re-use.
Some other software tools such as JWPL Revision machine makes a considerable effort to repack the Wikipedia revision history dataset by storing only differences between consecutive revisions, which saves a lot of hard disk capacity (for instance, the dump versioned on 2014 May 02 amounts only 380 GBytes of hard disk). While in many cases this is sufficient to get basic structures from the dataset, such as linking network, it is unable to make text-intensive queries (for instance, full-text over a sequence of revisions of different articles) without re-constructing the text from the beginning revisions and thus result in a huge bottleneck, if not killing the database server. What is more, JWPL lacks the flexibility support to extend its functionalities, even though the authors make quite an effort to provide as rich family of APIs as possible. It is thus impossible to extract ad-hoc information from Wikipedia revision (for example, extracting all anchor texts of all revisions belonging to a certain categories). Finally, JWPL is a centralized setting, it takes very long time to preprocess (one month to index the dataset) before the first experiment / testing code can be called. In several research scenarios, this is unacceptable.
And that is where Hedera comes in.
With Hedera, we try to make Wikipedia revision history dataset accessible to users of different domains and demands. Instead of providing a full-fledged (and thus cumbersome, difficult to extend) indexing and storing architecture, we built Hedera as an incremental processing tool for Wikipedia revisions. Users can quickly process on subset of Wikipedia revisions (for example, a few dump files instead of all, or just work with articles), and get the first results before waiting too long. They can also extend the system using their programming languages of choice (at the moment, Hedera supports Java, Pig Latin and Python) to extract custom data without too much effort. This pages provide a brief guidance to use Hedera for handling Wikipedia revision history dataset, from the raw files.
Getting started
1. Download Dataset
To get started, download the XML dumps from Wikipedia here. The matching files of revision history are named enwiki-[DATE]-pages-meta-history[NO].xml-pXXXXX.*
, where DATE and NO is the version of the dumps as delivered at different time, and No is the index of the files. You can work with one or several, or all the files, even from different dates (if that makes sense to you). Put them in the same HDFS directory.
2. Build the Project
Next, obtain the source code from the Github and compile it (The below code assumes you are familiar with Maven build tool):
git clone https://github.com/antoine-tran/Hedera
cd Hedera
mvn install -DskipTests
this will create a Jar file hedera-XXX.jar
, and copy all of its third parties to the directory named "libhedera" in the root directory of the project. If your cluster works on a different version of Hadoop or Pig, replace those in this directory (or change the version accordingly in the POM file). Alternatively, you can also package all libraries and the compiled code into one big jar with shade:shade
plugin (details here).
3. Input Format
Before moving the next steps, you should get yourself familiar with Hedera InputFormat. If you do not work much with Hadoop input format, you can read here for more conceptual explanations. Hedera currently supports the following Input Format:
Input format | Interface (key,value) | Description |
---|---|---|
WikiRevisionTextInputFormat |
(LongWritable, Text) | Transform each revision of a Wikipedia page into a pair with revision Id as key, and textual content as value |
WikiRevisionPageInputFormat |
(LongWritable, Revision) | Transform each revision of a Wikipedia page into a pair with revision Id as key, and textual content in Java object Revision as value (see Javadoc API) |
WikiRevisionPairInputFormat |
(LongWritable, Text) | Transform each two consecutive revisions of a Wikipedia page into a pair with page Id as key, and textual content of the revisions as value (see API for the XML format) |
WikiRevisionDiffnputFormat |
(LongWritable, RevisionDiff) | Transform each two consecutive revisions of a Wikipedia page into a pair with page Id as key, and the differences between the textual contents of the revisions as value. The differences are computed by Myer's algorithm, the results are output as Java RevisionDiff object (see Javadoc API) |
RevisionLinkInputFormat |
(LongWritable, LinkProfile) | Transform each revision of a Wikipedia page into a pair with revision Id as key, and the list of outgoing links to other Wikipedia pages as appeared in the revision text as value. The results are output as Java LinkProfile object (see Javadoc API) |
Extension
(Note: Documentation in progress)
It is possible to define your own custom input formats. All above input formats follow the similar data structures and work flow, implemented in WikiRevisionInputFormat
. The core building block of this abstract input format is WikiRevisionReader
, which defines two functions: doWhenMatch()
and readUntilMatch()
. These functions handle the logic when the parser reaches the content in between the revisions or pages. More details come soon..
Filtering
In many cases, you do not want to process every Wikipedia pages, and prefer a quick filtering on the raw files before sending the data into map reduce. Hedera supports a number of filtering options to do so:
Filtering non-article pages
To filter the non-article pages from raw XML files, just set the configuration variable org.hedera.input.onlyarticle
to true:
In Java:
yourJobConf.setBoolean(WikiRevisionInputFormat.SKIP_NON_ARTICLES, true);
In Pig:
SET 'org.hedera.input.onlyarticle' true
Sampling to time intervals
The revision history dump contains all text in Wikimedia projects from the beginning (2001) till the date of the dumps. You can specify one particular date of your choice with the following code snippets:
In Java:
hadoop jar hedera-XXX.jar [CLASS_NAME] -begin [TIME1] -end [TIME2]
In Pig
SET 'org.hedera.input.begintime' TIME1
SET 'org.hedera.input.endtime' TIME2
where TIME1 and TIME2's are strings in ISOTime format.
4. Information Extraction
(Documentation in progress)
To take full advantages of Hedera's functionalities, the best way is to look at example testing codes provided in the project's source code, as well as via API (Hedera is currently growing very fast, but we try to keep the documentation up-to-dated with significant milestones we go along). Below we provide a brief guidelines to work with Hedera in Java, Pig and Python, the three programming languages currently supported in Hedera.
Java:
In Java: If you work in Java, using or writing an information extraction program is fairly easy. Depending on your need, you can choose the appropriate Input format mentioned above, and define the mapper and reducer functions according to the (key,value) pairs output by the input format. For example, if you want to work with differences between two consecutive revisions of the same Wikipedia page, then your Mapper will look like:
class MyMapper extends Mapper<LongWritable, RevisionDiff, <YOUR KEY TYPE>, <YOUR VALUE TYPE>> {
...
}
We have provided a few example extraction program as map reduce job in the package org.hedera.mapreduce
(see API)). For your first appetite, this is a quick example to illustrate how you can make use of Hedera extraction features. The following command extracts directly from raw XML revision dumps the graph of inter-linkings between Wikipedia articles, labeled each edges by the anchor text and the timestamp of the revisions. The graph is output in edge list format:
sh etc/run-jars.sh /path/to/hedera-XXX.jar org.hedera.mapreduce.ExtractTemporalAnchorText /path/to/your-revision-xml-dumps /path/to/output/results [NUMBER OF REDUCERS: 1,2 40, etc.]
Simply replace the map reduce job org.hedera.mapreduce.ExtractTemporalAnchorText
by others (built-in or defined on your own). If you do not know the parameters of the job, just leave them empty - a helper message will be printed out to guide you how to specify the inputs.
Pig Latin:
To extract information from XML dumps via Pig Latin, you need to register Hedera in your Pig script:
REGISTER '$YOUR_LIBDIR/hedera-XXX.jar';
In order to work with different input formats, you need to use the corresponding UDF Pig Loader (Read here if you have not worked with loaders before). The loaders can be found in the package org.hedera.pig.load
. Each loader relies on one type of input format, and it performs the transformation of data structure from java into a tuple of org.apache.pig.data.Tuple
type. You can find more examples on how to use Pig scripts to extract various information in the directory "pig" under the root directory. For your first appetite in playing with Hedera and Pig, please check the example "Flattening dumps" below to have an idea how Hedera can be used in Pig Latin.
Python:
If you want to write your map-reduce job in Python, for example via Hadoop streaming, you need to transform the data into string-like format, for example, in JSON. The following code assumes that you have transform each revision in into a JSON object using our built-in flatten script (see Data flattening section below), and serialize it as one line in a text file. Then you can write your mapper as follows:
#!/usr/bin/env python
from mrjob.job import MRJob
import json
class YourMRJob(MRJob):
def mapper(self,key,line):
obj = json.loads(line)
title = obj['page_title']
# extract other information
yield __your_key__, __your_output_value__
if __name__ == "__main__":
YourMRJob.run()
(Just in case you might wonder, the above code snippet uses mrjob framework. This is only for illustration, you can write any Python code the way you want)
You can find more examples on how to use Python scripts to extract various information in the directory "python" under the root directory.
Flattening dumps to CSV files
One of the favourite format for handling text in different programming languages (e.g. Python) is plain-text CSV files. In Hedera you can flatten the XML dumps into CSV files simply by calling the Pig script XML2JSON.pig in source directorypig/utils
. The output is the set of .csv files, each containing one revision per line in JSON format with the following schema:
{
"page_id":NUMBER,
"page_title":STRING,
"page_namespace":NUMBER,
"rev_id":NUMBER,
"parent_id":NUMBER,
"timestamp":NUMBER,
"user":STRING,
"user_id":NUMBER,
"comment":STRING,
"text":STRING
}
You can try to see how this works by calling:
pig -p BASEDIR="/directory/path/to/your/input-xml-dumps" -p OUTPUT="/path/to/output/results" XML2JSON.pig
Now, if you continue to work with Java, Hedera can then pipeline these new formatted files via the input format WikiFullRevisionJsonInputFormat
, which transforms each line into a pair of (LongWritable
, FullRevision
), with FullRevision is the extended data structure of Revision
with additional data on user and disccussing comments on the revision (see API here).
If you continue to work with Python, read these files line by line and parse the information using a JSON parser library (see example for Python above). It is as simple as that !!