public class RobotsManagerImpl extends Object implements RobotsManager
Modifier and Type | Field and Description |
---|---|
private static long |
DEFAULT_MIN_WAITING_TIME |
private long |
defaultMinWaitingTime |
private crawlercommons.fetcher.http.BaseHttpFetcher |
fetcher |
private InetAddress |
lastIpAddress |
private static org.slf4j.Logger |
LOGGER |
private crawlercommons.robots.BaseRobotsParser |
parser |
private crawlercommons.robots.BaseRobotRules |
robotRules |
private static String |
ROBOTS_FILE_NAME |
Constructor and Description |
---|
RobotsManagerImpl(crawlercommons.fetcher.http.BaseHttpFetcher fetcher) |
RobotsManagerImpl(crawlercommons.fetcher.http.BaseHttpFetcher fetcher,
crawlercommons.robots.BaseRobotsParser parser) |
Modifier and Type | Method and Description |
---|---|
long |
getMinWaitingTime(org.dice_research.squirrel.data.uri.CrawleableUri curi)
Returns the minimum time a crawler should wait before sending a new
request to the given domain.
|
protected crawlercommons.robots.BaseRobotRules |
getRules(org.dice_research.squirrel.data.uri.CrawleableUri curi) |
boolean |
isUriCrawlable(org.dice_research.squirrel.data.uri.CrawleableUri curi)
Returns true, if the robots.txt file does not forbid the crawling of that
URI.
|
void |
setDefaultMinWaitingTime(long defaultMinWaitingTime) |
private static final org.slf4j.Logger LOGGER
private static final String ROBOTS_FILE_NAME
private static final long DEFAULT_MIN_WAITING_TIME
private long defaultMinWaitingTime
private crawlercommons.fetcher.http.BaseHttpFetcher fetcher
private crawlercommons.robots.BaseRobotsParser parser
private InetAddress lastIpAddress
private crawlercommons.robots.BaseRobotRules robotRules
public RobotsManagerImpl(crawlercommons.fetcher.http.BaseHttpFetcher fetcher)
public RobotsManagerImpl(crawlercommons.fetcher.http.BaseHttpFetcher fetcher, crawlercommons.robots.BaseRobotsParser parser)
protected crawlercommons.robots.BaseRobotRules getRules(org.dice_research.squirrel.data.uri.CrawleableUri curi)
public boolean isUriCrawlable(org.dice_research.squirrel.data.uri.CrawleableUri curi)
RobotsManager
isUriCrawlable
in interface RobotsManager
curi
- the URI that should be crawledpublic long getMinWaitingTime(org.dice_research.squirrel.data.uri.CrawleableUri curi)
RobotsManager
getMinWaitingTime
in interface RobotsManager
curi
- a URI containing the domain to which two or more requests
should be send.public void setDefaultMinWaitingTime(long defaultMinWaitingTime)
Copyright © 2017–2020. All rights reserved.