Collectors and Serializers

An URI collector stores the serialized URIs that have been found by a analyzer while crawling/processing a certain URI. After the crawling, the URI collector will be asked for these URIs using the getUris(CrawleableUri) method, to send the found URIs to the frontier. The collector requires a Serializer. Squirrel offers 3 self explained serializers:

  • org.dice_research.squirrel.data.uri.serialize.java.SnappyJavaUriSerializer
  • org.dice_research.squirrel.data.uri.serialize.java.GzipJavaUriSerializer
  • org.dice_research.squirrel.data.uri.serialize.gson.GsonUriSerializer

There are two possible implementations for Collectors, with all of then receiving a Serializer bean on constructor:

  • SimpleUriCollector: Stores the found URIs on memory. This collector should be used only for test purposes or for small graphs. Create the bean with:
    
    			<bean id="uriCollectorBean" 
    			class="org.dice_research.squirrel.collect.SimpleUriCollector" >
    			 <constructor-arg index="0" ref="serializerBean" /> 
    			</bean>
  • SqlBasedUriCollector: Stores the found URIs on a local HSQLDB. While this collector is insignificantly slower than the SimpleUriCollector, it can store much more URIs, since they are serialized on disk. Create the bean with:
    
    
    		<bean id="uriCollectorBean"
    				class="org.dice_research.squirrel.collect.SqlBasedUriCollector">
    				<constructor-arg index="0" ref="serializerBean" />
    				<constructor-arg index="1" value="foundUris" />
    		</bean>
    		 
    The second parameter on the constructor refers to the HSQLDB database filename.