For this Project 4 Crawl milestone, your project must maintain the functionality from the Project v3.1 Build assignment, as well as create a web crawler that can add a single web page to the inverted index.
TABLE OF CONTENTS
You must complete the following assignments before beginning to work on this one:
Your main method must be placed in a class named Driver and must process the following additional command-line arguments:
-html [seed] where the flag -html indicates the next argument [seed] is the seed URI the web crawler should download and process to build the index. See the “HTML Processing” section for how the download and processing must be completed.
If the -html flag is present, assume multithreading is enabled as if the -threads flag was also provided.
In other words, both the -threads and -html flags will trigger multithreaded classes to be initialized. If the -threads flag is not present, use the default number of threads to initialize the work queue.
These are in addition to the command-line arguments from the previous Project v3.1 Build assignment.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!
Web pages must be requested using sockets and HTTP/S from the web server as follows:
200 OK HTTP/S response status code and HTML content type, then download, process, and add the HTML content to the inverted index.200 OK is returned. Associate the final response with the original cleaned URI and process. For example, the URI ~cs212/redirect/one eventually redirects to ~cs212/simple/hello.html. The web crawler will associate the HTTPS response of ~cs212/simple/hello.html with the original URI ~cs212/redirect/one when processing.For efficiency (and to avoid being blocked or rate-limited by the web server), do not download unnecessary content and only download necessary content exactly once from the web server. Specifically:
200 OK status code. For example, only the headers (not the content) will be downloaded for large text file without the text/html content-type, or for a 404 status web page.