How We Converted 134 MS Word Documents with Very Special Requirements to DITA

As you probably know, one of the cool features of DITAToo is an ability to automatically convert even unstyled MS Word documents to DITA (and if you are still not aware of this feature, you can watch a video on Converting Word Documents to DITA).

In many cases, the way the conversion works out-of-the-box is exactly what our customers need. However, some customers have very specific conversion requirements that are not covered by our conversion mechanism out-of-the-box. We address these requirements by customizing our core conversion algorithm and develop custom conversion tools.

Today I want to tell you about a Word to DITA conversion project that we did for one of our customers. We’ve received 134 MS Word documents, over 3,000 pages total that should be converted to DITA.

The project requirements included:

  • Conversion of reusable portions of text to conref’s.  The documentation is translated to around 30 languages. To reduce translation costs, all occurrences of user interface elements (for example, names of menu options, window titles, field labels, and button captions) should be taken out from the original text and put into a separate file. From this file, they should be pulled into topics via conrefs. When all user interface elements are kept in a single file, translators can translate their names just there, and the translated labels and captions appear in all places where they occur.

    Additionally, this approach helps the customer quickly update the user interface text in the all places where it appears.

    The customer gave us several DITA files where conref sources were already defined. In the original Word document, the placeholders were added. They were formatted with a certain style. Our task was to find such a placeholder, look for a corresponding conref source, and replace it with the conref.

  • Automatic generation of links. The original Word documents contained a lot of references to other sections. However, they weren’t actual references, but rather a plain text representing the names of the sections to which the link should be generated.

    Because inline links usually interrupt the text flow and distracted the reader’s attention, the customer wanted all links to be put at the end of the topic as related links. So our conversion tool had to find a “reference” in the original Word document, look through other documents for a target using the name of the referenced section, and generate a link.

  • Automatic conversion of conditional content. The original Word documents contained a lot of conditional text. The conditional text was implemented by putting special marks in the beginning and in the end of the conditional piece of content. Originally, these marks were processed by a customer’s home-grown macro. Depending on the output, the macro could hide or expose the content between these marks. In DITA, this content was supposed to be conditionalized using conditional attributes.

    A problem was that in the legacy Word documents, a “conditional” content could start in the beginning of one section and end in the middle of a next section. Or it could start somewhere in the middle of a section, completely cover the next few sections, and then end after the first paragraph of another section. This required implementing a sophisticated logic that could correctly identify whether an entire DITA topic should be conditionalized in a map or certain elements within the topic should be conditionalized.

  • Automatic identification of concepts and tasks. That was the easiest part actually. Our existing conversion mechanism can identify the information type of an original piece of content and convert it to either concepts or tasks.
  • Assembling converted DITA topics to maps. As a part of the conversion, a set of maps should be created. Each map should resemble the structure of the original Word document. Like I said, some of the topics should be conditionalized in a map.

The conversion included three stages.

On stage 1, we just used our existing algorithm with a few modifications. The algorithm converted the Word documents to DITA and preserved the names of all styles in the outputclass attribute. We needed the style names for this project because styles were used to indicate the text to be replaced with a conref, references to other sections, and some other pieces of content that should be handled in a special way.

On stage 2, the conversion tool which we developed based on our core algorithm, processed the content using the value of the outputclass attributes. For example, if the conversion tool found that the outputclass=”reference”, then it looked for a destination topic, generated a link, and put it under <related-links> .

In parallel, the conversion tool automatically generated a log file with errors. For example, in many cases we had a situation when a text was marked as a reference, but the target couldn’t be found anywhere in the original documents. Similarly, quite often the conref source just didn’t exist although a placeholder for this conref was in place. Or a closing conditional mark was missing so there is no way to know where the conditional content ends. The log helped us easily identify the problem, report it to the customer, and find a solution.

Finally, on stage 3 we had to manually clean up the issues that couldn’t be resolved automatically. Fortunately, because our conversion tool was quite sophisticated, this part didn’t take much time.

As a result, over 6,000 topics and over 120 DITA maps were automatically created.

The project was full of challenges, but we did it, made our customer happy, and used what we’ve learned to improve the core conversion algorithm. So if you have legacy documents in Word and want to convert them to DITA, we’ll be happy to help!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation