Subscribe via Email

Subscribe via RSS/JSON


Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan


How to remove complex scripts from Word DOCX documents

Recently came across a Word document where some parts of the document seemed to ignore the general rules. The document was in English, and its language was set to English (U.S.) but certain parts were set to Arabic (Saudi) and none of the usual methods of selecting the text and marking it as English (U.S.) was helping. Very weird.

After a lot of fiddling around I also noticed that if I change the style of a paragraph containing such text, the adjoining text changes but this particular one stays as it. I am able to change the font and size directly by applying them, but changes via styles seem to get ignored.

Then I realized that although this text was in English, since it was marked as Arabic (Saudi) they were being treated as “complex scripts” in the style definitions and hence had separate rules. I guess that at some point someone had marked this text as of being Arabic (Saudi) and continued typing in English, or perhaps the original text was Arabic but someone had changed the font to an English one like “Times New Roman” and typed in English, so even though the text was appearing as English in fact Word was treating it as Arabic written in English (I guess). Anyways, point was Word was treating these blocks as complex scripts (as opposed to Latin for other parts) and so the usual formatting rules didn’t apply to them. Moreover I could change the language from Arabic (Saudi) to Arabic (UAE), for instance, so that seemed to support my theory that it was letting me changing the language to other complex scripts – just not from complex to Latin and vice versa.

This being a DOCX file, it is really just a zip file. So I unzipped it using 7-Zip. Went to the word\styles.xml file (which I came across through trial and error actually, I went through pretty much all the XML files there) in the extracted folder and found theĀ  following:

Since I didn’t want the document to have any Arabic at all, I simply changed the “ar-SA” to “en-US”. Saved the XML file, went back to the extracted folder, and zipped all its contents up again. Renamed this from .zip to .docx and opened the document, and bingo! now all that complex stuff weirdness was gone! :)

(A word to note about zipping back the folder. The format is ZIP. And also, don’t zip the top level folder as then your zip file will be the top level folder followed by all the sub-folders. No, what we want is that the zip file is all the sub-folders directly).