Fixing the Word TOC Mess - A First Try

In my last post (A Hook into Stellent), I described how my organization was able to create a hook into Stellent that allows an external script to be executed upon conversion of native documents to perform transformations to each converted document. I also described the problem my organization was having with Stellent’s conversion of Microsoft Word tables of contents. I will describe our current solution to the Word tables of contents (TOC) conversion problem. I’m providing this solution to provide a specific example of how we are able to hook into Stellent, but what I describe here may soon be replaced by an improved solution.

This solution was meant to be somewhat modular, since we were expecting to have it solve multiple problems. Here is the content of our transform.rb script:

#!/usr/local/bin/ruby
sleep 10  # wait to be sure that Stellent is done copying the file
require 'ftools'
require 'rubygems'
require 'rubyful_soup'  # ensure that Rubyful Soup contains fix for maintaining character references
filename = ARGV[0]
if ! File.exist?(filename)  # give up if HTML file does not exist
 exit
end
dirname = File.dirname(File.expand_path(filename))
filename_original = "#{filename}.original"
filename_transformed = "#{filename}.transformed"

need_transform = false
if ! File.exist?(filename_original) or  # both .original and .transformed files must exist
   ! File.exist?(filename_transformed) or
   File.mtime(filename) > File.mtime(filename_original)  # and HTML file must be newer than .original file
 need_transform = true
end

if need_transform
 File.copy(filename, filename_original)  # make copy of HTML file as .original file
 File.utime(Time.now, File.mtime(filename), filename_original)  # maintain timestamp for .original copy

 soup = BeautifulSoup.new('', :maintain_entity_references => true)  # create a parse tree of the HTML
 File.open(filename_original, "r") do |file|
   file.each_line do |line|
     soup.feed(line)
   end
 end

 require 'fix_word_toc'  # "plugin" for fixing the Word TOC
 soup = TransformSCM::FixWordTOC.transform_rubyful_soup(soup)

 File.open(filename_transformed, "w") do |file|  # write the file out as .transformed
   file.puts soup.to_s
 end
 File.copy(filename_transformed, filename)  # overwrite the HTML file with .transformed file
end

Transform.rb is the Ruby script that is called by killit-hook.rb. (If you don’t yet know Ruby, I recommend Programming Ruby, 2nd Ed.. You can also Try Ruby! and may also enjoy reading Why’s (Poignant) Guide to Ruby.)

The initial idea for the transform.rb script was to handle whatever transformations were needed by first parsing the HTML file that was output by Stellent and creating a parse tree representation of the HTML, then passing the parse tree to various “plugins” that would perform some transformation, and finally overwriting Stellent’s HTML file with the transformed parse tree. The comments in the above code should provide an understanding of what’s happening within the script.

The HTML parser that I chose to use for this project is Rubyful Soup. Rubyful Soup does a great job parsing most any HTML that you throw at it. The transform.rb script uses Rubyful Soup to parse the HTML and then passes off the parse tree to each plugin (currently just the fix_word_toc plugin). We ran into three issues with Rubyful Soup:

  1. It does something strange with entity references (such as ” ”). We created a customized copy of the library that fixes this problem.
  2. It takes quite some time to parse large HTML documents, sometimes as long as a few minutes.
  3. It cannot manipulate the parse tree (not a problem for the fix_word_toc plugin, but it is an issue for another plugin that we’re developing).

Here’s the content of our fix_word_toc.rb script that is used by the transform.rb script:

module TransformSCM
 class FixWordTOC
   def self.transform_rubyful_soup(soup)
     soup.find_all('div') do |toc_tag|  # find divs
       if toc_tag['class'] == "TOC"     # of class 'TOC'
         toc_tag.children do |p_tag|                                 # for each child of the div
           next if ! p_tag.respond_to?('name') or p_tag.name != 'p'  # only if tag p
           tags = Array.new
           p_tag.recursive_children do |tag|  # grab each child of tag p
             tags.push tag
           end
           if tags.length > 0
             tag = tags.last  # the last child of tag p
             tag.sub!(/\s*[\w\d]+$/, '')  # gobble up digits at the end of string, plus any preceding spaces
           end
         end
       end
     end
     return soup
   end
 end
end

The fix_word_toc.rb script does the real work in solving the Word TOC conversion problem. For it to work correctly, Dynamic Converter must have been configured to set the class of each DIV element surrounding one or more TOC entries to “TOC”. It expects each TOC entry to be contained within a separate paragraph. Each paragraph may be made up of multiple elements. Each TOC entry must end with a space and a page number (either an integer or string of letters, such as a Roman numeral). These requirements are based on what is currently produced by Stellent when converting a Word TOC.

The fix_word_toc.rb script operates by taking the parse tree and finding each DIV that is of class “TOC”. For each paragraph within these DIVs, it gathers up all of the elements within the paragraph and strips out the final space and any character following the space in the last element of the paragraph.

So far, this solution has worked well. We expect all Word document TOCs to have page numbers (integer or Roman). If for some reason we start getting documents with TOCs that do not have page numbers, the last word of each section title in the TOC will be removed. But because of how Word represents TOCs and because of Stellent’s limited ability to manipulate these TOCs, we currently have no other solution for fixing the Stellent’s HTML rendition of Word TOCs.

As I mentioned earlier, we are planning to replace this solution. The new solution will uses another HTML parser that allows for manipulation of the parse tree. Hopefully the HTML parser will also be speedier and be actively supported.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options