This problem set is due Wednesday, September 14, at 11:59pm.
Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.
The famous webmail provider, TepidMail, has hired you to secure their webmail service. Your job is to design and implement a way to make it safe to view untrusted HTML emails.
TepidMail is a standard webmail service, so a TepidMail user can go to the TepidMail website and view their email using a web browser. Your job is to figure out how to safely display HTML emails. If an attacker sends a HTML email containing malicious HTML content to TepidMail users, we want to be sure this can't harm the TepidMail users or their machines.
You're going to write a sanitizing filter that TepidMail can invoke on the command line, like this:
Before showing an HTML email to one of their users, TepidMail will run it through this filter before sending it to the user's browser to be displayed. (For instance, the TepidMail mailserver might automatically run this filter on every incoming email that contains HTML content; then when the recipient goes to view their email, the filtered HTML document might be shown in a frame.) You have two goals:
./htmlfilter < untrustedemail.html > safeforviewing.html
Viewing a filtered HTML document should be as harmless as viewing an ASCII text file with, say, /bin/more (even if an attacker supplies the entire contents of an ASCII email, viewing it with /bin/more cannot harm your machine). In particular, reading an email from someone malicious should not cause any lasting side effects to the TepidMail user's machine that persist after their web browser is closed; it should not leak any confidential information (e.g., the contents of files on the user's hard disk; or, information about what the user is viewing in another window with the same browser); and it should not endanger the integrity of the user's machine (e.g., we must not allow it to tamper with a different web document that the user is viewing in another window using the same browser).
A filtered HTML document should be safe to view in any browser that is likely to be used by a significant number of TepidMail users: let's say, IE6 and later (e.g., IE7, IE8, etc.), Firefox 3 and later, Safari 4.0 and later, and a recent version of Chrome. (These all have non-trivial market share. For no good reason, I've excluded mobile browsers.)
Your code should be robust: it shouldn't crash on any input. Since TepidMail is going to run your program on malicious inputs, it would be embarassing if there is any input that causes your filter to crash uncleanly.
Your scheme must not only be secure; it must also be verifiably secure. You will have to provide an argument why it is reasonable to believe that your filter achieves this goal. As much as possible, you should strive to provide positive evidence of security, not just absence of evidence of insecurity.
For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. Ideally, it would also be nice to see inline images. However, to simplify your life, I'll let you make some simplifying assumptions: in particular, you don't need to support other content (e.g., CSS, scripts, Flash animations, videos, etc.).
I want you to come up with a design, implement it, document your basic architecture and security argument, and submit both the document and the code. Your submission should contain at least three files:
Then, email this file as an attachment to cs261hw1 at taverner.cs.berkeley.edu by the due date. I will be using automated scripts to run your programs, so please do follow the above framework. If it helps, here is reference code that demonstrates the required format: ref.tar.
tar cf your-lastname.tar .