An URI collector stores the serialized URIs that have been found by a analyzer while crawling/processing a certain URI. After the crawling, the URI collector will be asked for these URIs using the getUris(CrawleableUri) method, to send the found URIs to the frontier. The collector requires a Serializer. Squirrel offers 3 self explained serializers:
There are two possible implementations for Collectors, with all of then receiving a Serializer bean on constructor:
<bean id="uriCollectorBean"
class="org.dice_research.squirrel.collect.SimpleUriCollector" >
<constructor-arg index="0" ref="serializerBean" />
</bean>
<bean id="uriCollectorBean"
class="org.dice_research.squirrel.collect.SqlBasedUriCollector">
<constructor-arg index="0" ref="serializerBean" />
<constructor-arg index="1" value="foundUris" />
</bean>
The second parameter on the constructor refers to the HSQLDB database filename.