Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Thread framework question (java)

Options
  • 24-08-2014 5:40am
    #1
    Registered Users Posts: 6,240 ✭✭✭


    I have a homework task to complete before a technical interview next week, and I am slightly confused over the wording
    Create a simple webcrawler which pings & crawls a number of websites (read from an xml file)
    The number of threads executing the crawler should remain the same until the threads are finished. e.g. if the crawler dies it should restart
    There is no need to cater for timeouts.

    So my logic is to have an Executor service with the the fixed thread pool
    ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
    

    Now I am slightly confused over the part in bold, and I am hoping if anyone has a better understanding than me, then to let me know.

    But for me, for each website, execute a crawl.
    If there are any failures, add them to a fail list and then recrawl until no more left on the list (I can have a max-failure just in case)

    so something like
    List<URL> listOfURLs = new ArrayList<URL>();
    listOfURLs  = getURLList(); // read from some source
    List<CrawlResult> = new ArrayList<CrawlResult>();
    ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
    int failureCount = 0;
    while ( !listOfURLs.isEmpty && failureCount < MAX_FAIL) {
    	for (URL url: listOfURLs) {
    		exec.execute(new MyRunnableCrawl(url));
    		//MyRunnable will contain the crawl code and if fail add URL to a synchronised failedURL list	
    	}
            listOfURLs.clear();
    	//tell the exec after these URLs we are done
    	exec.shutdown();
    	try {
    		boolean b = exec.awaitTermination(60, TimeUnit.SECONDS); // no need to cater for timeout exceptions
    	} catch (InterruptedException e) { e.printStackTrace(); }
    	//resassign the list of URLS to crawl to the ones which have failed (if any)
    	listOfURLs = getFailureList();
    	failureCount++;
    }
    

    It's been six months since I have looked at Java (I was away travelling) so might be a bit rusty in some areas. So any feedback is welcome


Comments

  • Registered Users Posts: 159 ✭✭magooly


    when the crawler dies it should begin where it left off and not at the beginning of the list of sites again.

    Keep a record of each site processed in a sychronized file all the threads write to when a site is done, load the contents of this file into a set and foreach element in your list of sites above verify if it exists in the processed list first.

    finally wrap your crawler process in an external managing process ie a shell process on linux I dunno what the Windows equivalent is
    (while true; do
    java -myjar MyCrawler
    sleep 30;
    done) &

    The above will run inifinitely so maybe add some intelligence to your java program to signal done to the managing process perhaps through a file.


Advertisement