For this Project 4 Crawl milestone, your project must extend the functionality from the Project v4.0 Webpage assignment to support a multithreaded web crawl that can add multiple pages to the inverted index.

TABLE OF CONTENTS

Prerequisites

You must complete the following assignments before beginning to work on this one:

Untitled

Functionality

Your main method must be placed in a class named Driver and must process the following additional command-line arguments:

-html [seed] must be modified such that it also enables multithreading even if the -threads flag is not present. It must also modify how links are processed on web pages.

See the “Link Processing” section below for details.
-crawl [total] may optionally be provided such that the flag -crawl indicates the next argument [total] is the total numbers of URLs to crawl when the -html flag is provided.

If the -crawl flag is not provided, or the [total] argument is not provided or an invalid number, then the -html flag should only download and process only 1 web page (the seed).

See the “Web Crawl” section below for how to determine what pages to crawl.

These are in addition to the command-line arguments from the previous Project v3.1 Build assignment.

The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!

Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!

Link Processing

The way that web pages are processed from Project v4.0 Webpage must be modified such that:

Remove HTML comments and the head, style, script, noscript, and svg block elements.
From the remaining text, find all of the hyperlinks using the a anchor tag and href property within the HTML content in the order they are provided on the page.
Remove the remaining HTML tags and proceed the same as in the Project v4.0 Webpage milestone.