OpenXML & VSTO & VBA - Finding a reliable mechanism for reading the correct value of CharactersWithSpaces 'extended-properties' in Word documents [part 1/2].


 

This article is split across two blog posts and this is part #1 .. use this link to go to part #2.


There are many scenarios when we have to know exactly how many words, pages, paragraphs or characters are found in a certain document:

   > we might be developing a tool which searches through files; If we want to add OpenXML files to
      its list of known formats, then it would be nice it we could offer a progress estimation when 
      users search inside a document (example: display [x% completed out of yyyy word count);

   > or you may need to count the number of words for a company that translates documents, to
      know how much they need to charge customers;
   
   > in case some public institution builds a web-page where documents can be uploaded; If there
      is a requirement / limitation that forces any uploaded document to contain up to xxxx words,
      then we need a code that reliably reads this information;

   .. and there are many other scenarios where counting words comes in handy. 


Let's assume that we have to count the number of words in an OpenXML Word document on a server.

If the 'word counter' tool runs on the server side, then you can't use COM automation, as this action would place you in an unsupported scenario. 

public void WordInteropDocStatistics()
{
 object missing = System.Reflection.Missing.Value;
 Microsoft.Office.Interop.Word.Application WdApp;
 Microsoft.Office.Interop.Word.Document    WdDoc;
 int CharactersIncludingSpaces;

 WdApp = new Microsoft.Office.Interop.Word.Application();
 WdDoc = WdApp.Documents.Open(PathName, 
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing,
                              ref missing);

  
 CharactersIncludingSpaces = WdDoc.ComputeStatistics(Microsoft.Office.Interop.Word.WdStatistic.
 wdStatisticCharactersWithSpaces, ref missing);


 WdApp
.Quit(false, 
            ref missing,
            ref missing);


 System.Runtime.InteropServices.Marshal.ReleaseComObject( WdApp );
}

So what alternative do we have? Answer: use OpenXML API.

A quick search on the internet reveals this code snippet:

public void WordOpenXMLDocStatistics()
{
 XmlDocument                xmlDocStructure;
 WordprocessingDocument     wordDocContainer;                
 ExtendedFilePropertiesPart extPropPart; 
 int CharactersIncludingSpaces

 wordDocContainer = WordprocessingDocument.Open(PathName, false);
 extPropPart      = wordDocContainer.ExtendedFilePropertiesPart;
 
 xmlDocStructure  = new XmlDocument();

 xmlDocStructure.Load(extPropPart.GetStream());

 
 CharactersIncludingSpaces =
 int.Parse(xmlDocStructure.GetElementsByTagName(
            "CharactersWithSpaces").Item(0).InnerText);

}

But will it work as expected? Does it report the same number of words as displayed inside MS Word editor?
Answer: Yes, it works but it's not reliable.  Sometimes it gives the expected results, sometimes it returns slightly different numbers ....

Triggering the Word statistics mismatch problem

1.   Just create a new Word document, type "=rand(1)" (without the quotes), then press Enter key;
2.   Save it file using .docx type, then close it;
3.   Open the file using an OpenXML editor, or rename the document from .docx to .zip, open the docProps
      folder and then edit the app.xml file;  Note the values of these XML items:
      >  Pages
      >  Words
      >  Characters;
      >  Lines;
      >  Paragraphs
      >  CharactersWithSpaces;

4.   Close the editor, or if you renamed the file to .zip, restore its original extension; Open it again in Word;
5.   On the Review tab, in the Proofing group, click Word Count;
6.   Compare the statistics in the Word Count dialog with those noted from the app.xml file;

 

 Result: we easily notice that the numbers are different ...

7.   Close the document again, you should be prompted to Save it. Go ahead and click OK to store the updated
      document information;
8.   Open its internal OpenXML structure and this time you should see that the numbers match; 


The Word Count difference is by-design

   >  MS Word has a set of tasks that are scheduled to run when the application is in idle mode;
   >  the Compute Document Statistics task has a low priority;
   >  when the user changes content inside a document, Word just estimates the number of characters,
       paragraphs ... etc. The data is stored in a special field inside a documents and is marked as
       'estimated';
   >  when we save a file and the document statistics task has not yet been executed, we just get an
       estimated value inside the xmlProperties.GetElementsByTagName("CharactersWithSpaces")
       OpenXML field;

 Our Product Group's developers were notified about the inconsistency between OpenXML and COM 
 ObjectModel
results and they replied that it was their decision to store approximate values in OpenXML.
  
The main reason why things are happening in this way is performance. For a small file, the time needed to compute Word Count statistics is not a problem, but for a document with thousands of pages, it may take several seconds or even minutes. If Word application was set to compute all the statistics in real-time, before a save (or more often), the end-users would surely complain about application hangs during saves.

So the Product Group decided to compromise between speed and accuracy. As you have noticed, the difference between the estimated values and the real values are not bigger than 1 ~ 2% which is more than enough for regular users. 

 

Can we obtain 100% reliable statistics?

Method #1:
       The OpenXML data can be forced to update: when Word receives a query about the statistics (through 
       the Ribbon graphical interface or the VBA ObjectModel), it computes those results and always reports
       the correct numbers.

        The only difficult part is finding a way to trigger the update every time, so that we are sure that the 
        information stored in the file is up to data.

        Advantages: 
                                       > easy to implement;
                                       > OpenXML code is very simple;
                                       > works for all kinds of input files ... even very complex ones (containing
                                          embedded charts, shapes, nested tables ..etc);

         Disadvantages:
                                       > we can only force the Word Count update if we rely on a VBA / VSTO
                                          add-in installed on the client-side;
                                      > somehow, the automated Statistics Update Add-in has to be deployed to all 
                                         end-users; 

                     

Method #2:
       
Write our own OpenXML code and count the words ourselves.

        Advantages: 
                                       > no need for 'helper' tools;

         Disadvantages:
                                       > because the OpenXML format is VERY complex, the code will run reliably
                                          only for basic input files; If you want to extend the program to be able to 
                                          handle all kinds of input documents you will find that the complexity of the 
                                          code increases 
up to the point where it is not fasible to continue with the 
                                          project (you will very likely be forced to write individual code rules for targeting
                                          all kinds of exceptions and special conditions for XML text tags, that may
                                          appear in different combinations);

Hmmm ... something tells me that when you read the lines about the disadvantages of using Method #2, some wheels already started turning in your head. You must be thinking: "there must be some trick to get all the words ...".

OK, let me show you how you get started on this path ... if you decide that this approach is right for you, then you can further develop my sample code:

                      DISCLAIMER

Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment!

THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,       INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.

We grant You a nonexclusive, royalty-free right to use and modify the
Sample Code and to reproduce and distribute the object
 code form of
the Sample Code, provided that. You agree:
   (i)   to not use Our name, logo, or trademarks to market Your
         software 
product in which the Sample Code is embedded;
   (ii)  to include a valid copyright notice on Your software
         product in which the 
Sample Code is embedded; and
   (iii) to indemnify, hold harmless, and defend Us and Our
         suppliers from and against any claims 
or lawsuits,
         including attorneys’ fees, that arise or result from the
         use or distribution of t
he Sample Code.

 Code sample from below can be downloaded from this link.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.Xml;
using System.IO;

namespace ConsoleApp_WordCount_OpenXML
{
 
class Program
 

 
static System.IO.StreamWriter fileOutput;

 staticvoid Main(string[] args)
 {
 
fileOutput = new System.IO.StreamWriter("C:\\ConsoleApp_WordCount
               OpenXML.log
"
, true);
 
 //http://msdn.microsoft.com/en-us/library/cc974107.aspx
 
//http://www.devx.com/dotnet/Article/42221/1954

 string
document;
 
document = @"C:\Test\WordCountOpenXML\Sample.docx";

 fileOutput.WriteLine(
 
"--------------------------------------------------------");
 fileOutput.WriteLine(DateTime.Now.ToString() + " Opened input file: "
                    
 + document);

 //Namespace manager so we can identify the tags according to
   namespace

 const string wordmlNamespace = 
 "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
;

 NameTable nt                  = new NameTable();
 
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
 
nsManager.AddNamespace("w", wordmlNamespace);

 //open the Word document
 
WordprocessingDocument wordDoc = WordprocessingDocument.Open(document,true);
 
MainDocumentPart mainPart = wordDoc.MainDocumentPart;

 //identify the text nodes
 
XmlNodeList textnodes = null;
 
string nodeText;
 
XmlDocument mainpartdoc = new XmlDocument(nt);

 mainpartdoc.Load(mainPart.GetStream());
 
textnodes = mainpartdoc.SelectNodes("//w:t", nsManager);

 //loop through all text nodes
 
int counter = 0;
 int strLen = 0;
 
 foreach (XmlNode textnode in textnodes)
  {
  
nodeText = textnode.FirstChild.InnerText;

 /*
   strLen += nodeText.Length;
       >> the 'FirstChild.InnerText' data may contain characters from
          Unicode;
     
 >> in this case, the information returned by the .Length method is 
          wrong;

      >> we will use the 'String.ToCharArray()' to get a stream of Bytes,
         then we count all non-empty characters;
 */

  strLen += GetBytesCount_Unicode(nodeText);

   if (nodeText.IndexOf(' ', 0) >= 0)
    {
    
string[] strWords = nodeText.Split(' '); 
    /* >> the 'String.Split(<blank>)' method is unreliable!
   
   >> it will fail to recognize Unicode characters and will 
          
return a wrong number of array elements;
   
   >> I used it here just to show the characters in my program's 
          output Console;
    */

    /* if you put a break-point here and evaluate these 2 Byte arrays,
     
 you will notice that when this string '» ' is parsed, the first
     
 byte array contains:

       bWordsAsc[0] = 63 (?)
       bWordsAsc[1] = 32 (<blank>)

      ... which is obviously not correct, because the character '»' is 
      replaced by a '?' symbol, whereas the second array contains:

            bWordsUnicode[0] = 187 ('»')
     
bWordsUnicode[1] = 0
     
bWordsUnicode[2] = 32 (<blank>)
     
bWordsUnicode[3] = 0
   */

      byte[] bWordsAsc     = Encoding.ASCII.GetBytes(nodeText); 
   byte[] bWordsUnicode = GetBytes(nodeText);

   foreach (string strWd in strWords)
        {
  
     if (strWd.Length > 0)
   
          {
  
              Console.WriteLine("Word #" + Convert.ToString(counter++) +
                           "\t\t Text: " + strWd);
 
                  }
       }
 
}
else
 
{
  
Console.WriteLine("Word #" + Convert.ToString(counter++) +
                     "\t\t Text: " + nodeText);
 
}
}

wordDoc.Close();

Console.WriteLine("------------------------------------------------------
                  \r\nStatistics:\r\n"
);
Console.WriteLine("Total characters : " + Convert.ToString(strLen));
Console.WriteLine("CharactersWithSpaces / 6 : " + Convert.ToString
                 (strLen / 6) + "\r\n");

fileOutput.Close();

Console.WriteLine("Press any key to continue..");Console.ReadKey();

}

//END OF Main

staticbyte[] GetBytes(string str)
{
 
byte[] bytes = newbyte[str.Length * sizeof(char)];
 
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
  return bytes;
}


static
int GetBytesCount_Unicode(string str)
{
 
byte[] bytes = newbyte[str.Length * sizeof(char)];
 
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
 int i = 0;
 int counter = 0;

 fileOutput.WriteLine("---------------------------------------------
                       ---------------------------"
);

 for (i = 0; i < bytes.Length; i++)
  {
  
//Console.WriteLine("\t\t > bytes[{0}] = {1}", Convert.ToString(i), (char)
     bytes[i]);

   fileOutput.WriteLine("\t\t > bytes[{0}] = {1} \t CharCode: {2}", 
                        Convert
.ToString(i), (char)bytes[i], bytes[i]);
 

  //when counting the characters without spaces, enable the line from below 
  

  //if (bytes[i] != 0 && bytes[i] != 32 && bytes[i] != 160) 
  // 160 = Non-breaking space:
     http://en.wikipedia.org/wiki/Non-breaking_space
  // 32 = space, 160 = non-breaking space)

  if (bytes[i] != 0 )
    {
    
counter++;
   
}
 
}

 fileOutput.WriteLine("---------------------------------------
                       ---------------------------------"
);
 
fileOutput.WriteLine("Subtotal characters count: {0}\r\n"
                       Convert.ToInt16(counter)); 
 return counter;
 
}
}

}

How the code works?

>  first of all, we have to start with a very basic sample file and we must try to get the code to run for that input;
>  for my tests, I used a plain text document, containing 2 paragraphs (produced using the =rand(1) command); The 2nd paragraph was truncated .. it contains a special Unicode character: »  which is needed to demonstrate an issue that is very likely to be encountered, when building a reliable program that counts characters or words;

>  we need analyze the internal structure of an OpenXML file; There are several OpenXML viewers
    available, but I used the one that comes with the SDK: OpenXML Productivity Tool;

>  a WordprocessingML document contains a body element (named w:body) that contains all paragraph (w:p)structures. Each paragraph contains one or more text runs (named w:r). Each text run contains one or more text nodes (named w:t).

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
  xmlns:ve=http://schemas.openxmlformats.org/markup-compatibility/2006  
  xmlns:o
="urn:schemas-microsoft-com:office:office" 
  xmlns:r
=http://schemas.openxmlformats.org/officeDocument/2006/relationships   
  xmlns:m
=http://schemas.openxmlformats.org/officeDocument/2006/math
  xmlns:v
="urn:schemas-microsoft-com:vml"
  xmlns:wp
="http://schemas.openxmlformats.org/drawingml/2006/
            wordprocessingDrawing
"
  xmlns:w10
="urn:schemas-microsoft-com:office:word"
  xmlns:w
="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  xmlns:wne
="http://schemas.microsoft.com/office/word/2006/wordml">

<w:body>
 <w:p w:rsidR="009C767E" w:rsidRDefault="00B653FD">
   <w:r>
     <w:t>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.
  </w:t>

   </w:r>
</w:p><w:p w:rsidR="001D483C" w:rsidRDefault="001D483C">

...

 <w/w:body>
</w:document>

 >  I am not going to explain the first part of the sample code which handles the selection of each XML node of text, because the article Manipulating Word 2007 Files with the Open XML Format API (Part 1 of 3) does a much better job;

 > once we found a way to loop over each <w:t>structure ( foreach (XmlNode textnode in textnodes) ), we can quickly compute the number of characters; If we do a good job, we will get exactly the same value as the one stored in ExtendedFilePropertiesPart.CharactersWithSpaces XML package structure, when the document is forced to update its Statistics numbers;

 >  there are many things which can make our count result in a different number of characters; A complex document contains many hidden formatting characters ...  shapes, tables, fields and formatting (bullets, TOC, citations) ... even simple HYPERLINK fields will make the issue noticeable; If we don't ensure that all those characters are detected and summed or discarded, our count will be inaccurate; 

 >  my sample code is built to detect the easiest problem we might encounter while counting: the character encoding;
 >  this problem is not noticeable with simple input files; But when the code tries to parse paragraph #2, it will encounter a character which is not part of the ASCII character set;

 >  when counting characters, I tried to convert them into their byte code equivalent, then I used a variable which increases in value whenever a non-empty byte code (0) is encountered; 
 > but, I soon realized that for some characters, my count was producing bad results; I traced the error back to the function which was translating characters to bytes:
Encoding.ASCII.GetBytes;
 > when I switched to using String.ToCharArray(), I got the expected results;

 > you can also see this issue, if you put a breakpoint on the foreach (string strWd in strWords) instruction and step through the code until you have to process the 2nd paragraph; Whenever the code encounters a character which is not part of the ASCII code, it encodes it as a "?";

 > after switching to String.ToCharArray(), we get the Unicode character byte code, so we no longer have unknown character codes; But once we decide to use Unicode, we have to keep into account that all the characters occupy 2 bytes; So we must substract the 2nd byte from our count, if it is 0;

 > at the end, after counting all characters (including spaces), we can also find the number of words; 
 > searching through the byte array, trying to detect words separated by one or more spaces can be tricky; So I used a faster method (which could be unreliable): I am dividing my total number of characters by a ratio found by experimenting (in my example, I am dividing by 6);  

 

 

 > please note that not all spaces have the same Unicode value; The regular space character has the Unicode value of [00 32], but we also encounter other types: [00 160] = non-breaking space (see GetBytesCount_Unicode(string str) function, where if you wish to count only non-blank characters, you have to enable this line: if (bytes[i] != 0 && bytes[i] != 32 && bytes[i] != 160) );

The second part of this article demonstrates how you can move the word count operation on the client side, just by writing a few lines of VBA code. In this way, the end-user will update his document just before he's done working with it, so that your software does nothing else than read the CharactersWithSpaces extended property.  

 

 

Thank you for reading my article! Bye 🙂

 

 

 

 

Comments (0)

Skip to main content