Thread framework question (java)

hussey · 24-08-2014 5:40am #1

I have a homework task to complete before a technical interview next week, and I am slightly confused over the wording

Create a simple webcrawler which pings & crawls a number of websites (read from an xml file)
The number of threads executing the crawler should remain the same until the threads are finished. e.g. if the crawler dies it should restart
There is no need to cater for timeouts.

So my logic is to have an Executor service with the the fixed thread pool

ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);

Now I am slightly confused over the part in bold, and I am hoping if anyone has a better understanding than me, then to let me know.

But for me, for each website, execute a crawl.
If there are any failures, add them to a fail list and then recrawl until no more left on the list (I can have a max-failure just in case)

so something like

List<URL> listOfURLs = new ArrayList<URL>();
listOfURLs  = getURLList(); // read from some source
List<CrawlResult> = new ArrayList<CrawlResult>();
ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
int failureCount = 0;
while ( !listOfURLs.isEmpty && failureCount < MAX_FAIL) {
	for (URL url: listOfURLs) {
		exec.execute(new MyRunnableCrawl(url));
		//MyRunnable will contain the crawl code and if fail add URL to a synchronised failedURL list	
	}
        listOfURLs.clear();
	//tell the exec after these URLs we are done
	exec.shutdown();
	try {
		boolean b = exec.awaitTermination(60, TimeUnit.SECONDS); // no need to cater for timeout exceptions
	} catch (InterruptedException e) { e.printStackTrace(); }
	//resassign the list of URLS to crawl to the ones which have failed (if any)
	listOfURLs = getFailureList();
	failureCount++;
}

It's been six months since I have looked at Java (I was away travelling) so might be a bit rusty in some areas. So any feedback is welcome

magooly · 27-08-2014 10:20pm

when the crawler dies it should begin where it left off and not at the beginning of the list of sites again.

Keep a record of each site processed in a sychronized file all the threads write to when a site is done, load the contents of this file into a set and foreach element in your list of sites above verify if it exists in the processed list first.

finally wrap your crawler process in an external managing process ie a shell process on linux I dunno what the Windows equivalent is
(while true; do
java -myjar MyCrawler
sleep 30;
done) &

The above will run inifinitely so maybe add some intelligence to your java program to signal done to the managing process perhaps through a file.

Thread framework question (java)

Comments