For this Project 4 Crawl milestone, your project must maintain the functionality from the Project v3.1 Build assignment, as well as create a web crawler that can add a single web page to the inverted index.

TABLE OF CONTENTS

Prerequisites

You must complete the following assignments before beginning to work on this one:

Untitled

Functionality

Your main method must be placed in a class named Driver and must process the following additional command-line arguments:

-html [seed] where the flag -html indicates the next argument [seed] is the seed URI the web crawler should download and process to build the index. See the “HTML Processing” section for how the download and processing must be completed.

If the -html flag is present, assume multithreading is enabled as if the -threads flag was also provided.

In other words, both the -threads and -html flags will trigger multithreaded classes to be initialized. If the -threads flag is not present, use the default number of threads to initialize the work queue.

These are in addition to the command-line arguments from the previous Project v3.1 Build assignment.

The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!

Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!

HTML Processing

Web pages must be requested using sockets and HTTP/S from the web server as follows:

If the HTTP headers returned by the web server includes a 200 OK HTTP/S response status code and HTML content type, then download, process, and add the HTML content to the inverted index.
If the HTTP headers returned by the web server includes a redirect HTTP/S response status code, follow up to 3 redirects maximum until a 200 OK is returned. Associate the final response with the original cleaned URI and process. For example, the URI ~cs212/redirect/one eventually redirects to ~cs212/simple/hello.html. The web crawler will associate the HTTPS response of ~cs212/simple/hello.html with the original URI ~cs212/redirect/one when processing.

For efficiency (and to avoid being blocked or rate-limited by the web server), do not download unnecessary content and only download necessary content exactly once from the web server. Specifically:

Do not fetch the web page content if it is not HTML and a 200 OK status code. For example, only the headers (not the content) will be downloaded for large text file without the text/html content-type, or for a 404 status web page.