Implementing a new Analyzer

To create a new Analyzer, it is necessary to extends the class AbstractAnalyzer. Following the Fetcher implementation example, let's implement ananlyzer for JSON content:

     
import java.io.File;
import java.util.Iterator;

import org.dice_research.squirrel.analyzer.AbstractAnalyzer;
import org.dice_research.squirrel.collect.UriCollector;
import org.dice_research.squirrel.data.uri.CrawleableUri;
import org.dice_research.squirrel.sink.Sink;

public class JsonAnalyzer extends AbstractAnalyzer {

	public JsonAnalyzer(UriCollector collector) {
		super(collector);
	}

	@Override
	public Iterator analyze(CrawleableUri curi, File data, Sink sink) {
		// TODO Auto-generated method stub
		return null;
	}

	@Override
	public boolean isElegible(CrawleableUri curi, File data) {
		// TODO Auto-generated method stub
		return false;
	}

}
      

All implementations must define an explicit constructor from the super class, which will receive an UriCollector as argument. A URI collector stores the URIs that have been found by a worker while crawling/processing a certain URI. After the crawling, the URI collector will be asked for these URIs using the getUris(CrawleableUri) method, sending the URIs to the Frontier.

The isElegible method receives the CrawleableUri and fetched file. The objective of this method is to check if the analyzer can in fact analyze the content, returning a boolean value.

If the analyzer is elegible, it will call the analyze method. It receives the CrawleableUri, the fetched file and the sink where the extracted triples will be stored.

Let's take a look on an analyzer implementation to extract triples from JSON fetched data:


package org.dice_research.squirrel.analyzer.example;    

import java.io.File;
import java.util.Iterator;

import org.apache.jena.riot.Lang;
import org.apache.jena.riot.RDFDataMgr;
import org.apache.jena.riot.system.StreamRDF;
import org.dice_research.squirrel.analyzer.AbstractAnalyzer;
import org.dice_research.squirrel.analyzer.commons.FilterSinkRDF;
import org.dice_research.squirrel.collect.UriCollector;
import org.dice_research.squirrel.data.uri.CrawleableUri;
import org.dice_research.squirrel.sink.Sink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class JsonAnalyzer extends AbstractAnalyzer {

	private static final Logger LOGGER = LoggerFactory.getLogger(JsonAnalyzer.class);

	public JsonAnalyzer(UriCollector collector) {
		super(collector);
	}

	@Override
	public Iterator analyze(CrawleableUri curi, File data, Sink sink) {
		try {
      //initializes the StreamRDF using our FilterSinkRDF implementation
			StreamRDF filtered = new FilterSinkRDF(curi, sink, collector);
      //try to parse the file as JSONLD
			RDFDataMgr.parse(filtered, data.getAbsolutePath(), Lang.JSONLD);
			return collector.getUris(curi);
		} catch (Exception e) {
			LOGGER.error("Exception while analyzing. Aborting. ", e);
			return null;
		}
	}

	@Override
	public boolean isElegible(CrawleableUri curi, File data) {
    //if the fetched uri is json, then the analyzer is elegible
		if (curi.getData("type").equals("json")) {
			return true;
		}
		return false;
	}

}
      

The analyzer checks if the CrawledUri has the type information. If contains the "json" value, it is elegible.

Starting the analyze method, it should perform the necessary steps to extract triples from the focused content. In the end, the collector should be invoked calling the method getUris(CrawleableUri). This will return the serialized triples that were collected, and will be sent to the frontier.

Just like the fetcher, the analyzer requires to be referenced on the worker-context.xml as well:


                     
     <bean id="jsonAnalyzerBean" class="org.dice_research.squirrel.analyzer.example.JsonAnalyzer" >
        <constructor-arg index="0" ref="uriCollectorBean" />
     <bean/>

	   <bean id="analyzerBean" class="org.dice_research.squirrel.analyzer.manager.SimpleAnalyzerManager">
		<constructor-arg>
			<list>
            <ref bean="jsonAnalyzerBean" />
            <ref bean="rdfAnalyzerBean" />
 		  </list>
		</constructor-arg>
	</bean>

      

Notice that the Bean needs to reference an UriCollector in the constructor. Check the Collector and Serializers documentation for details.