@NotThreadSafe public class SqlBasedUriCollector extends Object implements UriCollector, Closeable
UriCollector
interface that is backed by a
SQL database.Modifier and Type | Class and Description |
---|---|
protected static class |
SqlBasedUriCollector.UriTableStatus |
Modifier and Type | Field and Description |
---|---|
protected int |
bufferSize |
protected static String |
COUNT_URIS_QUERY |
protected static String |
CREATE_TABLE_QUERY |
protected Connection |
dbConnection |
private static int |
DEFAULT_BUFFER_SIZE |
protected static String |
DROP_TABLE_QUERY |
protected static String |
INSERT_URI_QUERY_PART_1 |
protected static String |
INSERT_URI_QUERY_PART_2 |
protected Map<String,SqlBasedUriCollector.UriTableStatus> |
knownUris |
private static org.slf4j.Logger |
LOGGER |
private static int |
MAX_ALPHANUM_PART_OF_TABLE_NAME |
private static String |
SELECT_TABLE_QUERY |
protected org.dice_research.squirrel.data.uri.serialize.Serializer |
serializer |
private static Pattern |
TABLE_NAME_GENERATE_REGEX |
private static String |
TABLE_NAME_KEY |
private long |
total_uris |
Constructor and Description |
---|
SqlBasedUriCollector(org.dice_research.squirrel.data.uri.serialize.Serializer serializer,
String dbPath) |
Modifier and Type | Method and Description |
---|---|
void |
addNewUri(org.dice_research.squirrel.data.uri.CrawleableUri uri,
org.dice_research.squirrel.data.uri.CrawleableUri newUri)
Adds the given new URI to the list of URIs collected for the given URI.
|
void |
addTriple(org.dice_research.squirrel.data.uri.CrawleableUri uri,
org.apache.jena.graph.Triple triple)
Adds the given triple to the list of URIs collected from the given URI.
|
protected void |
addUri(org.dice_research.squirrel.data.uri.CrawleableUri uri,
org.apache.jena.graph.Node node) |
void |
close() |
void |
closeSinkForUri(org.dice_research.squirrel.data.uri.CrawleableUri uri) |
void |
create(String dbPath) |
protected static String |
generateTableName(String uri)
Generates a table name based on the given URI.
|
long |
getSize() |
long |
getSize(org.dice_research.squirrel.data.uri.CrawleableUri uri)
Returns the total of uris that have been collected
|
protected static String |
getTableName(org.dice_research.squirrel.data.uri.CrawleableUri uri)
Retrieves the URIs table name from its properties or generates a new table
name and adds it to the URI (using the "URI_COLLECTOR_TABLE_NAME" property).
|
Iterator<byte[]> |
getUris(org.dice_research.squirrel.data.uri.CrawleableUri uri)
Returns a list of serialized
CrawleableUri instances that have been
collected for the given URI. |
void |
openSinkForUri(org.dice_research.squirrel.data.uri.CrawleableUri uri) |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addNewUri, addNewUri
private static final org.slf4j.Logger LOGGER
protected static final String COUNT_URIS_QUERY
protected static final String CREATE_TABLE_QUERY
protected static final String DROP_TABLE_QUERY
protected static final String INSERT_URI_QUERY_PART_1
protected static final String INSERT_URI_QUERY_PART_2
private static final String SELECT_TABLE_QUERY
private static final String TABLE_NAME_KEY
private static final int MAX_ALPHANUM_PART_OF_TABLE_NAME
private static final int DEFAULT_BUFFER_SIZE
private static final Pattern TABLE_NAME_GENERATE_REGEX
private long total_uris
protected Connection dbConnection
protected org.dice_research.squirrel.data.uri.serialize.Serializer serializer
protected int bufferSize
protected Map<String,SqlBasedUriCollector.UriTableStatus> knownUris
public SqlBasedUriCollector(org.dice_research.squirrel.data.uri.serialize.Serializer serializer, String dbPath) throws SQLException
SQLException
public void create(String dbPath)
public void openSinkForUri(org.dice_research.squirrel.data.uri.CrawleableUri uri)
openSinkForUri
in interface org.dice_research.squirrel.sink.SinkBase
public Iterator<byte[]> getUris(org.dice_research.squirrel.data.uri.CrawleableUri uri)
UriCollector
CrawleableUri
instances that have been
collected for the given URI.getUris
in interface UriCollector
uri
- The URI from which the returned serialized URIs have been
collected.Iterator
that iterates over the already serialized URIs
that have been collected for the given URI.public void addTriple(org.dice_research.squirrel.data.uri.CrawleableUri uri, org.apache.jena.graph.Triple triple)
UriCollector
UriCollector.addNewUri(CrawleableUri, CrawleableUri)
method instead since this
enables the addition of meta data to the collected URI.addTriple
in interface UriCollector
uri
- The URI from which the given triple has been collected.triple
- The triple that has been collected.protected void addUri(org.dice_research.squirrel.data.uri.CrawleableUri uri, org.apache.jena.graph.Node node)
public void addNewUri(org.dice_research.squirrel.data.uri.CrawleableUri uri, org.dice_research.squirrel.data.uri.CrawleableUri newUri)
UriCollector
addNewUri
in interface UriCollector
uri
- The URI from which the given new URI has been collected.newUri
- The new URI that has been collected.public void closeSinkForUri(org.dice_research.squirrel.data.uri.CrawleableUri uri)
closeSinkForUri
in interface org.dice_research.squirrel.sink.SinkBase
public long getSize()
public long getSize(org.dice_research.squirrel.data.uri.CrawleableUri uri)
UriCollector
getSize
in interface UriCollector
uri
- The URI from which the returned serialized URIs have been
collected.public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
IOException
protected static String getTableName(org.dice_research.squirrel.data.uri.CrawleableUri uri)
uri
- the URI for which a table name is needed.protected static String generateTableName(String uri)
MAX_ALPHANUM_PART_OF_TABLE_NAME
=30
the exceeding part is cut off. After that the hash value of the original URI
is appended.uri
- the URI for which a table name has to be generatedCopyright © 2017–2020. All rights reserved.