To create a new Fetcher, it is necessary to implement the interface Fetcher:
package org.dice_research.squirrel.fetcher;
public interface Fetcher extends Closeable {
public default File fetch(CrawleableUri uri) throws RuntimeException {
return fetch(uri, DummyDelayer.get());
}
public File fetch(CrawleableUri uri, Delayer delayer);
}
It is necessary to implement the fetch method and close, inherited from the Closeable interface
The fetch method receives a CrawleableUri object, which represents a URI and additional meta data that is helpful for crawling it. The URI can be accessed through the getUri() method. The method must return a file object, which is the file that was fetched. After the fetcher is used, the close method will be called to close any open stream or finalize any other process.
Let's use as an example, a fetcher to fetch JSON Data:
package org.dice_research.squirrel.fetcher.example;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.nio.charset.Charset;
import org.apache.commons.io.FileUtils;
import org.dice_research.squirrel.data.uri.CrawleableUri;
import org.dice_research.squirrel.fetcher.Fetcher;
import org.dice_research.squirrel.fetcher.delay.Delayer;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class JsonFetcher implements Fetcher {
private static final Logger LOGGER = LoggerFactory.getLogger(JsonFetcher.class);
private InputStream is = null;
@Override
public void close() throws IOException {
is.close();
}
@Override
public File fetch(CrawleableUri uri, Delayer delayer) {
try {
delayer.getRequestPermission();
is = uri.getUri().toURL().openStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
String jsonText = readAll(rd);
JSONObject json = new JSONObject(jsonText);
File file = File.createTempFile("fetched_", "", FileUtils.getTempDirectory());
FileWriter fw = new FileWriter(file);
fw.write(json.toString());
fw.close();
uri.addData("type", "json");
return file;
} catch (Exception e) {
LOGGER.error("Could not fetch Json from URI: " + uri.getUri().toString(), e);
}
return null;
}
private String readAll(Reader rd) throws IOException {
StringBuilder sb = new StringBuilder();
int cp;
while ((cp = rd.read()) != -1) {
sb.append((char) cp);
}
return sb.toString();
}
}
The fetcher is initialized by invoking the getRequestPermission(), which will create a delay before the call, respecting the robots.txt directives. Then, will access the URI as an URL and parse the call result as a JSONObject. If the parse fails, an exception will be throw and null content will be returned. If the parse success, a temporary file will be created and the String representation of the JSONObject written into it. In the end, will be added metadata information that will be used on the Analyzer step.
To include the Fetcher on runtime execution, create a bean of the implementaion and include the bean reference on the FetcherManager construct list on the worker-context.xml
<bean id="jsonFetcherBean"
class="org.dice_research.squirrel.fetcher.example.JsonFetcher" />
<bean id="fetcherBean"
class="org.dice_research.squirrel.fetcher.manage.SimpleOrderedFetcherManager">
<constructor-arg>
<list>
<ref bean="jsonFetcherBean" />
<ref bean="httpFetcherBean" />
</list>
</constructor-arg>
</bean>