Worker Components

Squirrel uses Spring for dependency injection. The worker components uses the env variable CONTEXT_CONFIG_FILE to appoint the spring beans file. In this file, you can define all the implementations that the Worker Component will use on Runtime.

On the spring_config folder there is some examples of bean configurations for different combinations (storage on file system, sparql storage). If you are not used with the Spring Framework, read the documentation before.

The workerComponentBean is defined as follows:


            <bean id="workerBean"
            		class="org.dice_research.squirrel.worker.impl.WorkerImpl">
            		<constructor-arg index="0" ref="workerComponent" />
            		<constructor-arg index="1" ref="fetcherBean" />
            		<constructor-arg index="2" ref="sinkBean" />
            		<constructor-arg index="3" ref="analyzerBean" />
            		<constructor-arg index="4" ref="robotsManagerBean" />
            		<constructor-arg index="5" ref="serializerBean" />
            		<constructor-arg index="6" ref="uriCollectorBean" />
            		<constructor-arg index="7" value="2000" />
            		<constructor-arg index="8"
            			value="#{systemEnvironment['OUTPUT_FOLDER']}/log" />
            		<constructor-arg index="9" value="false" />
            	</bean>
      

Always use the class org.dice_research.squirrel.worker.impl.WorkerImpl for the implementation of the workerBean. The worker implementation receives 9 arguments:

  • workerComponent: it is responsible for initializing the worker and managing the other components. It is not possible to change this bean, because it is autowired. Always use the workerComponent reference.
  • fetcherBean: The bean that will manage the fetchers used by the worker instance.
  • sinkBean: The bean that will manage the sink.
  • analyzerBean: The bean that will manage the analyzers used by the worker. You can define which analyzers the bean will use it.
  • robotsManagerBean: this is the bean responsible for setting the rules for the robots exclusion standard. There is only one possible implementation for this bean.
  • uriCollectorBean: the bean responsible for collecting the uri’s found by the analyzer and that will be sent to the frontier.
  • waitingTime: Time (in ms) the worker waits when the given frontier couldn't provide any URIs before requesting new URIs again.
  • domainLogFile: location where the log will be stored.

For beans that has multiple implementations, you can find the reference bellow.