Tutorial

In this tutorial we will go through one benchmark using two systems, two datasets and one Stresstest.

We are using the following

Iguana v3.0.2
Apache Jena Fuseki 3
Blazegraph

Download

First lets create a working directory

mkdir myBenchmark
cd myBenchmark

Now let's download all required systems and Iguana.

Starting with Iguana

wget https://github.com/dice-group/IGUANA/releases/download/v3.0.2/iguana-3.0.2.zip
unzip iguana-3.0.2.zip

Now we will download Blazegraph

mkdir blazegraph && cd blazegraph 
wget https://downloads.sourceforge.net/project/bigdata/bigdata/2.1.5/blazegraph.jar?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fbigdata%2Ffiles%2Fbigdata%2F2.1.5%2Fblazegraph.jar%2Fdownload%3Fuse_mirror%3Dmaster%26r%3Dhttps%253A%252F%252Fwww.blazegraph.com%252Fdownload%252F%26use_mirror%3Dnetix&ts=1602007009
cd ../

At last we just need to download Apache Jena Fuseki and Apache Jena

mkdir fuseki && cd fuseki
wget https://downloads.apache.org/jena/binaries/apache-jena-3.16.0.zip
unzip apache-jena-3.16.0.zip

wget https://downloads.apache.org/jena/binaries/apache-jena-fuseki-3.16.0.zip
unzip apache-jena-fuseki-3.16.0.zip

Finally we have to download our datasets. We use two small datasets from scholarly data. The ISWC 2010 and the ekaw 2012 rich dataset.

mkdir datasets/
cd datasets
wget http://www.scholarlydata.org/dumps/conferences/alignments/iswc-2010-complete-alignments.rdf
wget http://www.scholarlydata.org/dumps/conferences/alignments/ekaw-2012-complete-alignments.rdf
cd ..

That's it. Let's setup blazegraph and fuseki.

Setting Up Systems

To simplify the benchmark workflow we will use the pre and post script hook, in which we will load the current system and after the benchmark stop the system.

Blazegraph

First let's create the script files

cd blazegraph
touch load-and-start.sh 
touch stop.sh

The load-and-start.sh script will start blazegraph and use curl to POST our dataset. In our case the datasets are pretty small, hence the loading time is minimal. Otherwise it would be wise to load the dataset beforehand, backup the blazegraph.jnl file and simply exchanging the file in the pre script hook.

For now put this into the script load-and-start.sh

#starting blazegraph with 4 GB ram
cd ../blazegraph && java -Xmx4g -server -jar blazegraph.jar &

#load the dataset file in, which will be set as the first script argument
curl -X POST H 'Content-Type:application/rdf+xml' --data-binary '@$1' http://localhost:9999/blazegraph/sparql

Now edit stop.sh and adding the following:

pkill -f blazegraph

Be aware that this kills all blazegraph instances, so make sure that no other process which includes the word blazegraph is running.

finally get into the correct working directory again

cd ..

Fuseki

Now the same for fuseki:

cd fuseki
touch load-and-start.sh 
touch stop.sh

The load-and-start.sh script will load the dataset into a TDB directory and start fuseki using the directory.

Edit the script load-and-start.sh as follows

cd ../fuseki
# load the dataset as a tdb directory
apache-jena-3.16.0/bin/tdbloader2 --loc DB $1

# start fuseki
apache-jena-fuseki-3.16.0/fuseki-server --loc DB /ds &

To assure fairness and provide Fuseki with 4GB as well edit apache-jena-fuseki-3.16.0/fuseki-server and go to the last bit exchange the following

JVM_ARGS=${JVM_ARGS:--Xmx1200M}

to

JVM_ARGS=${JVM_ARGS:--Xmx4G}

Now edit stop.sh and adding the following:

pkill -f fuseki

Be aware that this kills all Fuseki instances, so make sure that no other process which includes the word fuseki is running.

finally get into the correct working directory again

cd ..

Benchmark queries

We need some queries to benchmark.

For now we will just use 3 simple queryies

SELECT * {?s ?p ?o}
SELECT * {?s ?p ?o} LIMIT 10
SELECT * {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}

save this to queries.txt

Creating the Benchmark Configuration

Now let's create the Iguana benchmark configuration. Create a file called benchmark-suite.yml

touch benchmark-suite.yml

Add the following subscections to this file, or simply go to #Full Configuration and add the whole piece to it.

Be aware that the configuration will be started on directory level below our working directory and thus paths will use ../ to get the correct path.

Datasets

We have two datasets, the ekaw 2012 and the iswc 2010 datasets. Let's name them as such and set the file path, s.t. the script hooks can use the file paths.

datasets:
  - name: "ekaw-2012"
    file: "../datasets/ekaw-2012-complete-alignments.rdf"
  - name: "iswc-2010"
    file: "../datasets/iswc-2010-complete-alignments.rdf"

Connections

We have two connections, blazegraph and fuseki with their respective endpoint at them as following:

connections:
  - name: "blazegraph"
    endpoint: "http://localhost:9999/blazegraph/sparql"
  - name: "fuseki"
    endpoint: "http://localhost:3030/ds/sparql"

Task script hooks

To assure that the correct triple store will be loaded with the correct dataset add the following pre script hook ../{{connection}}/load-and-start.sh {{dataset.file}} {{connection}} will be set to the current benchmarked connection name (e.g. fuseki) and the {{dataset.file}} will be set to the current dataset file path.

For example the start script of fuseki is located at fuseki/load-and-start.sh.

Further on add the stop.sh script as the post-script hook, assuring that the store will be stopped after each task

This will look like this:

pre-script-hook: "../{{connection}}/load-and-start.sh {{dataset.file}}"
post-script-hook: "../{{connection}}/stop.sh

Task configuration

We want to stresstest our stores using 10 minutes (60.000 ms)for each dataset connection pair. We are using plain text queries (InstancesQueryHandler) and want to have two simulated users querying SPARQL queries. The queries file is located at our working directory at queries.txt. Be aware that we start Iguana one level below, which makes the correct path ../queries.txt

To achieve this restrictions add the following to your file

tasks:
  - className: "Stresstest"
    configuration:
      timeLimit: 600000
      queryHandler:
        className: "InstancesQueryHandler"
      workers:
        - threads: 2
          className: "SPARQLWorker"
          queriesFile: "../queries.txt"

Result Storage

Let's put the results as an NTriple file and for smootheness of this tutorial let's put it into the file my-first-iguana-results.nt

Add the following to do this.

storages:
  - className: "NTFileStorage"
    configuration:
      fileName: "my-first-iguana-results.nt"

Full configuration

datasets:
  - name: "ekaw-2012"
    file: "../datasets/ekaw-2012-complete-alignments.rdf"
  - name: "iswc-2010"
    file: "../datasets/iswc-2010-complete-alignments.rdf"

connections:
  - name: "blazegraph"
    endpoint: "http://localhost:9999/blazegraph/sparql"
  - name: "fuseki"
    endpoint: "http://localhost:3030/ds/sparql"

pre-script-hook: "../{{connection}}/load-and-start.sh {{dataset.file}}"
post-script-hook: "../{{connection}}/stop.sh

tasks:
  - className: "Stresstest"
    configuration:
      timeLimit: 600000
      queryHandler:
        className: "InstancesQueryHandler"
      workers:
        - threads: 2
          className: "SPARQLWorker"
          queriesFile: "../queries.txt"          

storages:
  - className: "NTFileStorage"
    configuration:
      fileName: "my-first-iguana-results.nt"

Starting Benchmark

Simply use the previous created benchmark-suite.yml and start with

cd iguana/
./start-iguana.sh ../benchmark-suite.yml

Now we wait for 40 minutes until the benchmark is finished.

Results

As previously shown, our results will be shown in my-first-iguana-results.nt.

Load this into a triple store of your choice and query for the results you want to use.

Just use blazegraph for example:

cd blazegraph
../load-and-start.sh ../my-first-iguana-results.nt

To query the results go to http://localhost:9999/blazegraph/.

An example:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX iprop: <http://iguana-benchmark.eu/properties/>
PREFIX iont: <http://iguana-benchmark.eu/class/>
PREFIX ires: <http://iguana-benchmark.eu/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?taskID ?datasetLabel ?connectionLabel ?noq {
    ?suiteID rdf:type iont:Suite . 
    ?suiteID iprop:experiment ?expID .
    ?expID iprop:dataset ?dataset .
    ?dataset rdfs:label ?datasetLabel
    ?expID iprop:task ?taskID .
    ?taskID iprop:connection ?connection.
    ?connection rdfs:label ?connectionLabel .
    ?taskID iprop:NoQ ?noq.
}

This will provide a list of all task, naming the dataset, the connection and the no. of queries which were succesfully executed

We will however not go into detail on how to read the results. This can be read at Benchmark Results