Configuring Filters

Filtering is an essential task on crawling, responsible for defining it which URI's should be crawled and which URI's should not.

Squirrel allows the user to configure multiple filters on the frontier configuration through the use of the UriFilterConfigurator class. This class is required the be instanced on the frontier-context.xml in order for the Frontier to work. Check the example below:


	<bean id="UriFilterBean"
		class="org.dice_research.squirrel.data.uri.filter.UriFilterConfigurator">
		<constructor-arg index="0" ref="mongoDBKnowUriFilter" />
		<constructor-arg index="1">
			<list>
				  <ref !-- additional filters references  -->
			</list>
		</constructor-arg>
		<constructor-arg index="2" value="OR" />
	</bean>
	
	
	<bean id="mongoDBKnowUriFilter"
		class="org.dice_research.squirrel.data.uri.filter.MongoDBKnowUriFilter">
		<constructor-arg index="0"
			value="#{systemEnvironment['MDB_HOST_NAME']}" />
		<constructor-arg index="1"
			value="#{systemEnvironment['MDB_PORT']}" />

	
	   

      

The constructor receives three arguments:

  • A filter that implements the KnownUriFilter interface. This implementation certifies that an URI which was crawled, should not be crawled again (except if the recrawling from time to time is enabled). On the example above, it is used the default implementation: MongoDBKnowUriFilter, which stores the already crawled URI's on MongoDB.
  • A list of other filter implementations that will be composed with the KnownUriFilter. Only references that implements the UriFilter interface are allowed. The list can be empty.
  • An operator that express the condition of filtering. Two options is available: "AND" - "OR". If "AND" is typped, All the filters must return true to stop the crawling. The "OR" options, will stop the crawling if at least one returns true.

Implementing a new Filter

Filter implementations should implement the UriFilter interface.


/**
 * A simple filter that can decide whether a given {@link CrawleableUri} object
 * imposes a certain requirement or not.
 * 
 * @author Michael Röder (roeder@informatik.uni-leipzig.de)
 *
 */
public interface UriFilter {

    /**
     * Returns true if the given {@link CrawleableUri} object fulfills the
     * requirements imposed by this filter.
     * 
     * @param uri
     *            the {@link CrawleableUri} object that is checked
     * @return true if the given {@link CrawleableUri} object fulfills the
     *         requirements imposed by this filter. Otherwise false is returned.
     */
    public boolean isUriGood(CrawleableUri uri);
    
    
    /**
     * Adds the given URI to the list of already known URIs. Works like calling {@link #add(CrawleableUri, long)} with the current system time.
     *
     * @param uri the URI that should be added to the list.
     * 
     */    
    public void add(CrawleableUri uri);