Thursday, May 20, 2010

New Document Editor: HTML diffs

One of the goals of my project is to create revision control for Melange documents. Each RC framework must provide some kind of diff engine. The engine should find changes in the document and represent them to the user. It is pretty simple with text, but Melange stores documents as HTML.

Talking about HTML diffs, we should consider two cases:
  1. HTML as text. A bunch of tags, attributes, values and content.
  2. HTML as a rendered document (e.g. image)
Another question is: what causes representation changes in a rendered HTML document?
  1. Tags (through browser's default CSS or applied CSS).
  2. Attributes class and id (through applied CSS)
  3. Applied CSS (server-side, in-document or in-line).
  4. Style modification with DOM.
Let's consider first case - HTML as text. This case seems pretty obvious, cause changes can be tracked with textual diff engines. But it's not that simple. There are several cases of changes to HTML which don't influence representation:
  1. Tags are changed, but CSS is the same:
    <h1>Hello, world!</h1>
    is changed to
    <h2>Hello, world!</h2>
    but the CSS is
    h1, h2 {font-size: 12px; font-style: normal;}
  2. Some pieces of HTML are rendered the same:
    <div class="alert">Hello, world!</div>
    and
    <div class="alert">
    <p>Hello, world!</p>
    </div>
  3. Class or id is changed but CSS is the same:
    <div class="original"></div>
    is changed to
    <div class="changed"></div>
The second case - HTML as an image. I mean HTML with applied CSS which is displayed to the user. Tracking changes to images is the correct way of handling HTML diffs. It can be performed with several tools. One of them is convert utility from ImageMagick. This approach is, however, a little tricky and CPU consuming.

Let's return to HTML as text. If we can guarantee that CSS represents different tags different and there are no class or id changes, then everything seems right. We can focus on textual diffs for HTML. Considering TinyMCE (which is a defaul editor for Melange) all representation changes are made with tags. If there is no appropriate tag, then the style is applied with <span> tag or a chain of <span> tags.

Textual HTML diffs can be generated by several tools. One of them is HTML diff for Python. I'm now thinking about using it as a skeleton and try to make more intellectual engine with Beautiful Soup

No comments:

Post a Comment