Analyzers

The Analyzer is responsible for retrieving triples from fetched data. It is possbile to have multiple Analyzers beans running, they have to be referenced by the generic implementation, like described below:


	   
	   
	<bean id="analyzerBean"
	class="org.dice_research.squirrel.analyzer.manager.SimpleAnalyzerManager">
		<constructor-arg index="0">
			<list>
				 <ref bean="rdfAnalyzerBean"/>
				 <ref bean="ckananalyzerBean"/>         
				 <ref bean="rdfaAnalyzerBean"/>    
				 <ref bean="microdataanalyzerBean"/>          
				 <ref bean="microformatanalyzerBean"/>          
				 <ref bean="hdtanalyzerBean"/>          
				<ref bean="htmlscraperanalyzerBean"/>                                 
			</list>
		</constructor-arg>
	</bean>

      

All the analyzers must extend the abstract class org.dice_research.squirrel.analyzer.impl.AbstractAnalyzer and override the anayze method. The constructor of the analyzer receives a class of type org.dice_research.squirrel.collect.UriCollector. All the analyzers should implement the analyze and isElegible methods. The isElegible method, will check if that analyzer implementation is capable of dealing with the fetched data and if it is will call the analyze method. The analyze method, will receive the URI that is being crawled, the fetched file and the sink sik implementation chosen

Currently, the following analyzers are available:

  • Analyzer for RDF files. The following serializations are supported by this analyzer:
    • RDF/XML
    • N-Triples, N3, NQ and N-Quads
    • Turtle
    • TTL
    • TRIG and TRIX
    • JsonLD
  • RDFa Analyzer for HTML and XHTML Documents.
  • HTML Scraper. An Analyzer for scrapping HTML pages. It uses the Jsoup framework for scrapping and can be configured by the usage of yaml files to define how it should scrape a certain domain and its contexts or pages. For instructions about how to use and configure the HTML Scraper, click here.
  • The CKAN Analyzer is used for the JSON lines files which are loaded from the CKAN API. It transforms the information about datasets in the CKAN portal into RDF triples using the DCAT ontology.
  • Any23-based analyzer that handles Microdata and Microformat.