Filtering is an essential task on crawling, responsible for defining it which URI's should be crawled and which URI's should not.
Squirrel allows the user to configure multiple filters on the frontier configuration through the use of the UriFilterConfigurator class. This class is required the be instanced on the frontier-context.xml in order for the Frontier to work. Check the example below:
<bean id="UriFilterBean"
class="org.dice_research.squirrel.data.uri.filter.UriFilterConfigurator">
<constructor-arg index="0" ref="mongoDBKnowUriFilter" />
<constructor-arg index="1">
<list>
<ref !-- additional filters references -->
</list>
</constructor-arg>
<constructor-arg index="2" value="OR" />
</bean>
<bean id="mongoDBKnowUriFilter"
class="org.dice_research.squirrel.data.uri.filter.MongoDBKnowUriFilter">
<constructor-arg index="0"
value="#{systemEnvironment['MDB_HOST_NAME']}" />
<constructor-arg index="1"
value="#{systemEnvironment['MDB_PORT']}" />
The constructor receives three arguments:
Filter implementations should implement the UriFilter interface.
/**
* A simple filter that can decide whether a given {@link CrawleableUri} object
* imposes a certain requirement or not.
*
* @author Michael Röder (roeder@informatik.uni-leipzig.de)
*
*/
public interface UriFilter {
/**
* Returns true if the given {@link CrawleableUri} object fulfills the
* requirements imposed by this filter.
*
* @param uri
* the {@link CrawleableUri} object that is checked
* @return true if the given {@link CrawleableUri} object fulfills the
* requirements imposed by this filter. Otherwise false is returned.
*/
public boolean isUriGood(CrawleableUri uri);
/**
* Adds the given URI to the list of already known URIs. Works like calling {@link #add(CrawleableUri, long)} with the current system time.
*
* @param uri the URI that should be added to the list.
*
*/
public void add(CrawleableUri uri);