My organization uses Stellent Web Content Management for publishing content to the web. One issue that has eaten up quite a bit of our time is getting better control over the HTML that is produced by the system. The product that converts native documents (such as Microsoft Word documents) to HTML is called Stellent Dynamic Converter (DC). DC has the ability to use rules to manipulate the HTML document during conversion. We have version 7.5 of DC, which uses something called a GUI template to customize the conversion process. Unfortunately, GUI templates do not seem to be as full featured as the script templates that were available prior to version 6.
Until recently, we coped with the limitations of DC. One limitation that we deemed unacceptable was DC’s treatment of Microsoft Word’s Tables of Contents (TOC). Our expectation for the content management system (CMS) is to manage our native documents and to allow them to be displayed in the native application, printed to paper, and be displayed as web pages. The problem is that TOC is used differently in each of these cases. When a document is viewed using the native application or printed from it, the TOC is used to find a section by using the page number. When a document is viewed as a web page (or viewed using the native application), the TOC is used as a link to easily move to the referenced section.
When a Word document contains a TOC with page numbers, DC renders the TOC with the page numbers. Not a show stopper, except that the page numbers are separated from the TOC entry titles with a single space. This causes confusing TOC entries such as “Introduction 1” and “Appendix A 12”. Stellent’s support group was unable to find an acceptable solution.
Luckily our system administrator noticed some behavior that lead to a hook (a.k.a. hack) into the conversion process.
Note: We’re using Site Studio and Dynamic Converter on a Red Hat Enterprise Linux server. Your mileage may vary.
The command that is executed during conversion is something like this:
/bin/sh /cs/shared/os/linux/lib/htmlexport/killit.sh /cs/shared/os/linux/lib/htmlexport/dcexport -c /cs/vault/~temp/htmlexport/943462647.hda -f /cs/weblayout/groups/public/@top/@topic/documents/content/~export/DEV01_123~6~DCT_A~temp0952/987.hcst 2>&1
We had known that dcexport is the executable that the system uses to convert documents from native format to HTML. The important find was that killit.sh, a shell script, is called and passed with the full command for executing dcexport. I’m not a shell programmer, but from what I can make of this script, it simply sets up some traps to kill off the dcexport process after an exceptional condition occurs. Why is this script interesting? It provides a method for injecting our own script to transform the converted HTML. We were able to kick off our transformation script by replacing this line in killit.sh
$@ &
with this line:
/cs/shared/os/linux/lib<wbr></wbr>/htmlexport/killit-hook.rb "$@" &
The killit-hook.rb script (yes, it’s Ruby) contains:
/#!/usr/local/bin/ruby
command = ARGV.join( " " )
dcexport = false
if command =~ /^(.*)\/dcexport\s+-c\s+.*\s+-f\s+([^\s]+)~temp\d+\/([^\/]+\.hcst)/)
dcexport_path = $1
html_filename = "#{$2}/#{$3}"
dcexport = true
end
`#{command}`
if dcexport
exec("/usr/local/bin/ruby -I #{dcexport_path} #{dcexport_path}/transform.rb #{html_filename} &") if fork.nil?
exit
endThis script accomplishes a couple of things. First, it always executes the command that is passed to it (see the backticks?). Second, if dcexport is the command, it executes transform.rb, another ruby script, passing in the filename of the HTML file that dcexport has just created. The reason for the fork is to create a child process to keep killit.sh from killing off the transform.rb script.
Notice that the filename of the HTML file is parsed to remove the temporary portion (~temp0957) of the name, since the system creates a temporary file which, upon completion of the conversion process, is renamed to not include the temporary portion of the name.
The transform.rb script does the work of fixing our Microsoft Word TOC to HTML conversion issue. But that’s a posting for another day.
Because we’re modifying non-documented scripts, modifying filenames of temporary files to determine the final HTML files, and manipulating the final HTML files, this solution is brittle and may break when we upgrade our system. But it does solve our current issue and we have plans for using it to solve some other issues that have no official Stellent solution.
Post new comment