In my last post (A Hook into Stellent), I described how my organization was able to create a hook into Stellent that allows an external script to be executed upon conversion of native documents to perform transformations to each converted document. I also described the problem my organization was having with Stellent’s conversion of Microsoft Word tables of contents. I will describe our current solution to the Word tables of contents (TOC) conversion problem. I’m providing this solution to provide a specific example of how we are able to hook into Stellent, but what I describe here may soon be replaced by an improved solution.
This solution was meant to be somewhat modular, since we were expecting to have it solve multiple problems. Here is the content of our transform.rb script:
#!/usr/local/bin/ruby
sleep 10 # wait to be sure that Stellent is done copying the file
require 'ftools'
require 'rubygems'
require 'rubyful_soup' # ensure that Rubyful Soup contains fix for maintaining character references
filename = ARGV[0]
if ! File.exist?(filename) # give up if HTML file does not exist
exit
end
dirname = File.dirname(File.expand_path(filename))
filename_original = "#{filename}.original"
filename_transformed = "#{filename}.transformed"
need_transform = false
if ! File.exist?(filename_original) or # both .original and .transformed files must exist
! File.exist?(filename_transformed) or
File.mtime(filename) > File.mtime(filename_original) # and HTML file must be newer than .original file
need_transform = true
end
if need_transform
File.copy(filename, filename_original) # make copy of HTML file as .original file
File.utime(Time.now, File.mtime(filename), filename_original) # maintain timestamp for .original copy
soup = BeautifulSoup.new('', :maintain_entity_references => true) # create a parse tree of the HTML
File.open(filename_original, "r") do |file|
file.each_line do |line|
soup.feed(line)
end
end
require 'fix_word_toc' # "plugin" for fixing the Word TOC
soup = TransformSCM::FixWordTOC.transform_rubyful_soup(soup)
File.open(filename_transformed, "w") do |file| # write the file out as .transformed
file.puts soup.to_s
end
File.copy(filename_transformed, filename) # overwrite the HTML file with .transformed file
endTransform.rb is the Ruby script that is called by killit-hook.rb. (If you don’t yet know Ruby, I recommend Programming Ruby, 2nd Ed.. You can also Try Ruby! and may also enjoy reading Why’s (Poignant) Guide to Ruby.)
The initial idea for the transform.rb script was to handle whatever transformations were needed by first parsing the HTML file that was output by Stellent and creating a parse tree representation of the HTML, then passing the parse tree to various “plugins” that would perform some transformation, and finally overwriting Stellent’s HTML file with the transformed parse tree. The comments in the above code should provide an understanding of what’s happening within the script.
The HTML parser that I chose to use for this project is Rubyful Soup. Rubyful Soup does a great job parsing most any HTML that you throw at it. The transform.rb script uses Rubyful Soup to parse the HTML and then passes off the parse tree to each plugin (currently just the fix_word_toc plugin). We ran into three issues with Rubyful Soup:
- It does something strange with entity references (such as ” ”). We created a customized copy of the library that fixes this problem.
- It takes quite some time to parse large HTML documents, sometimes as long as a few minutes.
- It cannot manipulate the parse tree (not a problem for the fix_word_toc plugin, but it is an issue for another plugin that we’re developing).
Here’s the content of our fix_word_toc.rb script that is used by the transform.rb script:
module TransformSCM
class FixWordTOC
def self.transform_rubyful_soup(soup)
soup.find_all('div') do |toc_tag| # find divs
if toc_tag['class'] == "TOC" # of class 'TOC'
toc_tag.children do |p_tag| # for each child of the div
next if ! p_tag.respond_to?('name') or p_tag.name != 'p' # only if tag p
tags = Array.new
p_tag.recursive_children do |tag| # grab each child of tag p
tags.push tag
end
if tags.length > 0
tag = tags.last # the last child of tag p
tag.sub!(/\s*[\w\d]+$/, '') # gobble up digits at the end of string, plus any preceding spaces
end
end
end
end
return soup
end
end
end
The fix_word_toc.rb script does the real work in solving the Word TOC conversion problem. For it to work correctly, Dynamic Converter must have been configured to set the class of each DIV element surrounding one or more TOC entries to “TOC”. It expects each TOC entry to be contained within a separate paragraph. Each paragraph may be made up of multiple elements. Each TOC entry must end with a space and a page number (either an integer or string of letters, such as a Roman numeral). These requirements are based on what is currently produced by Stellent when converting a Word TOC.
The fix_word_toc.rb script operates by taking the parse tree and finding each DIV that is of class “TOC”. For each paragraph within these DIVs, it gathers up all of the elements within the paragraph and strips out the final space and any character following the space in the last element of the paragraph.
So far, this solution has worked well. We expect all Word document TOCs to have page numbers (integer or Roman). If for some reason we start getting documents with TOCs that do not have page numbers, the last word of each section title in the TOC will be removed. But because of how Word represents TOCs and because of Stellent’s limited ability to manipulate these TOCs, we currently have no other solution for fixing the Stellent’s HTML rendition of Word TOCs.
As I mentioned earlier, we are planning to replace this solution. The new solution will uses another HTML parser that allows for manipulation of the parse tree. Hopefully the HTML parser will also be speedier and be actively supported.
Post new comment