For this Project 4 Crawl milestone, your project must extend the functionality from the Project v4.0 Webpage assignment to support a multithreaded web crawl that can add multiple pages to the inverted index.
TABLE OF CONTENTS
You must complete the following assignments before beginning to work on this one:
Your main method must be placed in a class named Driver and must process the following additional command-line arguments:
-html [seed] must be modified such that it also enables multithreading even if the -threads flag is not present. It must also modify how links are processed on web pages.
See the “Link Processing” section below for details.
-crawl [total] may optionally be provided such that the flag -crawl indicates the next argument [total] is the total numbers of URLs to crawl when the -html flag is provided.
If the -crawl flag is not provided, or the [total] argument is not provided or an invalid number, then the -html flag should only download and process only 1 web page (the seed).
See the “Web Crawl” section below for how to determine what pages to crawl.
These are in addition to the command-line arguments from the previous Project v3.1 Build assignment.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!
The way that web pages are processed from Project v4.0 Webpage must be modified such that:
head, style, script, noscript, and svg block elements.a anchor tag and href property within the HTML content in the order they are provided on the page.