Getting Started With Squirrel

In order to set up Squirrel, please download the last version from the Git Hub Project Page:

https://github.com/dice-group/Squirrel/releases

After downloading it, decompress it on a folder of your choice. Squirrel uses Docker as the container, so you will need to install Docker in your OS if you have not. You can check Docker CE installation instruction here. Make sure that you have Docker-Compose installed as well. Check here for instructions.

After the installation of the Docker Engine and Docker-Compose, go to the extracted folder and run:

docker-compose up -d mongodb rabbit

This will start the rabbitmq and mongodb images, for the communication between the frontier and workers and the frontier queue, respectively

Set your seeds on the seed/seeds.txt file. Now you can run the frontier and one worker instance with:

docker-compose up frontier worker1

The Docker-Compose File

The docker-compose file is where you can configure your images to run Squirrel. There, you can notice some environment variables for the Frontier and Workers. For the Frontier, most of then are self explanatory, like the MDB_HOST_NAME, MDB_PORT and HOBBIT_RABBIT_HOST, which points to the current MongoDB and RabbitMQ running instances. The SEED_FILE variable points to the file containing the initial URI’s that will be crawled and it is obligatory. The URI_WHITELIST_FILE however, should contain only domains that will be allowed to crawl. If a domain is not on the list, the URI will be ignored. This variable is not required or the file can be empty.

From the worker side, you have the same variables pointing to MongoDB and RabbitMQ instances. Still, you have the CONTEXT_CONFIG_FILE variable, which points to the Spring Bean metadata (see section below). This variable is obligatory to the Worker initialize its components. OUTPUT_FOLDER and HTML_SCRAPER_YAML_PATH are specific related to the FileBasedSink and HtmlScraperAnalyzer and you can read more about then in the Worker Components section. You can copy and paste many worker images as you want, and customize then with different implementations.

Squirrel releases comes with two pre configured docker-compose files. The docker-compose.yml configured for file storage and docker-compose-sparql.yml for sparql endpoints, including virtuoso configured container.


Working with Different Implementations

Squirrel uses Spring for dependency injection, which alows the user to control the implementations that will be used in run iime. Under spring-config/* you can fine some pre configured context files. There, you can configure which implementations you want to use for fetcher, analyzer and sink for the worker instance. By default, Squirrel comes with the worker-context.xml configured with the File Sink and the worker-context-sparql.xml with the SparqlBasedSink configured, for use with the docker-compose-sparql.yml. To know more about the default implementations available and how to write your own, please visit the Worker Components section. If you need reference about Spring metadata configuration, please check this link.