The `docx-xml` advanced skill: structural OOXML edits via run_code + lxml + zipfile, used only when the plain-text docx-editing tools can't express the change (deleting a paragraph mark, inserting multi-paragraph content, editing styles/headers/numbering, bulk rewrites). Establishes the canonical read-XML → mutate-with-lxml → repack-with-zipfile pattern (one document = one run_code call), warns off docx-js (write-only) and python-docx run-level mutation, then provides a deep OOXML reference: document structure, schema- compliance rules, smart-quote typography, tracked-change XML patterns, additive image insertion, common pitfalls, validation, plus Excel/business-day run_code recipes for non-DOCX work. When a structural Word edit is genuinely beyond edit_document and you need the raw OOXML manipulation patterns.
name: docx-xml emoji: "\U0001F527" description: "Advanced — structural OOXML edits via run_code + lxml + zipfile. Use ONLY when the docx-editing tools (edit_document, add_comment, accept/reject, revert_edit) can't express the change. For ordinary text edits, use the docx-editing skill instead." requires: tools: [read_document, run_code, list_documents]
DOCX XML — Advanced Skill
You are looking at this skill because a structural edit needs raw OOXML manipulation. Always check first whether edit_document can do it — see the docx-editing skill. Only drop to run_code for:
- Deleting a clause including its paragraph mark (needs
<w:p><w:pPr><w:rPr><w:del/></w:rPr></w:pPr>pattern) - Inserting multi-paragraph content mid-document
- Editing a part other than
word/document.xml(styles.xml, headers/footers, numbering.xml) - Bulk structural rewrites spanning dozens of paragraphs
Overview
A .docx file is a ZIP archive containing XML files. The main content is in word/document.xml. The OOXML namespace for Word is http://schemas.openxmlformats.org/wordprocessingml/2006/main (prefix w:).
Use language="python" + lxml + zipfile on the XML parts directly.
Why NOT docx-js: docx-js is write-only — there is no Document.load(...) API. It's for creating new DOCX files from scratch (the draft skill uses it). You cannot use it to load Contract.docx, mutate, and save back.
Why NOT python-docx for run-level mutation: python-docx silently corrupts complex legal templates. It loses non-target glyphs (other placeholders, special characters) when run.text is reassigned, has no reliable API for footnotes / bookmarks / internal hyperlinks / TableOfContents, and collapses <w:rPr> run properties when rewriting text — losing bold / font / size on surrounding runs. Never use python-docx for run.text.replace(...) or any run-level mutation. python-docx for reading and additive operations (adding images, adding paragraphs) is acceptable.
The right pattern — read XML, mutate with lxml, repack with zipfile
# run_code(language="python", input_files=["Contract.docx"], code=...)
import zipfile, io
from lxml import etree
NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
W = f"{{{NS}}}"
# 1. Read document.xml from the input DOCX.
with zipfile.ZipFile('/sandbox/input/Contract.docx') as zin:
doc_xml = zin.read('word/document.xml')
parts = {name: zin.read(name) for name in zin.namelist()}
# 2. Mutate with lxml (example — delete a specific paragraph including its mark).
root = etree.fromstring(doc_xml)
for p in root.iter(f"{W}p"):
text = "".join(t.text or "" for t in p.iter(f"{W}t"))
if "text-to-delete" in text:
p.getparent().remove(p)
# 3. Repack, preserving every other part.
parts['word/document.xml'] = etree.tostring(root, xml_declaration=True,
encoding='UTF-8', standalone=True)
buf = io.BytesIO()
with zipfile.ZipFile(buf, 'w', zipfile.ZIP_DEFLATED) as zout:
for name, data in parts.items():
zout.writestr(name, data)
with open('/sandbox/output/Contract_Edited.docx', 'wb') as f:
f.write(buf.getvalue())
One document = one run_code call. A single call should open the file, make all changes, and save one output. Never split a document edit across multiple scripts.
OOXML XML Reference
Document Structure
word/document.xml
<w:body>
<w:p> ← paragraph
<w:pPr> ← paragraph properties (style, numbering, spacing)
<w:pStyle w:val="Heading1"/>
<w:numPr>...</w:numPr>
<w:spacing>...</w:spacing>
<w:ind>...</w:ind>
<w:jc>...</w:jc>
<w:rPr>...</w:rPr> ← always last child of pPr
</w:pPr>
<w:r> ← run (contiguous text with same formatting)
<w:rPr> ← run properties (bold, italic, font, size)
<w:b/> ← bold
<w:i/> ← italic
<w:sz w:val="24"/> ← font size in half-points (24 = 12pt)
<w:rFonts w:ascii="Arial"/>
</w:rPr>
<w:t xml:space="preserve">text content</w:t>
</w:r>
</w:p>
Schema-compliance rules
| Rule | What goes wrong if ignored |
|---|---|
<w:t> with leading or trailing whitespace must carry xml:space="preserve" |
Word strips the spaces during round-trip |
Inside <w:del>, deleted text content lives in <w:delText>, not <w:t> |
Tracked changes corrupt; Accept All leaves stale text |
<w:rPr> is always the last child of <w:pPr> |
Schema violation; some viewers reject the document |
When wrapping a run in tracked-change markup, copy the original <w:rPr> into your <w:ins> / <w:del> runs |
The change preserves bold / font / size of surrounding text |
When creating a tracked change, replace the entire <w:r> element rather than injecting markup inside it |
Avoids invalid run nesting and preserves Word's display logic |
Every w:id in tracked changes is unique |
Word silently merges or drops changes with duplicate IDs |
xml:space="preserve" is the most-missed of these rules — most run-level edits need it |
Spaces disappearing is invisible in source diff but obvious to the user |
Smart quotes (typography)
DOCX documents use smart quotes stored as Unicode. When adding text via lxml, use these:
| Code | Char | Description |
|---|---|---|
‘ |
' | Left single quote |
’ |
' | Right single quote / apostrophe |
“ |
" | Left double quote |
” |
" | Right double quote |
In lxml: elem.text = 'Here’s a quote: “Hello”'
Tracked changes XML patterns
Insertion:
<w:ins w:id="1" w:author="Anylegal.ai" w:date="2026-05-04T00:00:00Z">
<w:r><w:rPr>...</w:rPr><w:t>inserted text</w:t></w:r>
</w:ins>
Deletion:
<w:del w:id="2" w:author="Anylegal.ai" w:date="2026-05-04T00:00:00Z">
<w:r><w:rPr>...</w:rPr><w:delText>deleted text</w:delText></w:r>
</w:del>
Minimal edit — only mark what changes (governing law change from Delaware to England and Wales):
<w:r><w:t>This Agreement is governed by the laws of </w:t></w:r>
<w:del w:id="1" w:author="Anylegal.ai" w:date="...">
<w:r><w:delText>the State of Delaware</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Anylegal.ai" w:date="...">
<w:r><w:t>England and Wales</w:t></w:r>
</w:ins>
<w:r><w:t>.</w:t></w:r>
Deleting an entire paragraph — mark the paragraph mark as deleted too, otherwise "Accept All" leaves an empty line:
<w:p>
<w:pPr>
<w:rPr>
<w:del w:id="1" w:author="Anylegal.ai" w:date="..."/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Anylegal.ai" w:date="...">
<w:r><w:delText>The deleted clause text...</w:delText></w:r>
</w:del>
</w:p>
Rejecting a counterparty's insertion — nest your deletion inside their insertion (preserves their authorship in the audit trail):
<w:ins w:author="Counterparty" w:id="5">
<w:del w:author="Anylegal.ai" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>
Restoring a counterparty's deletion — add an insertion after their deletion (do not modify their deletion):
<w:del w:author="Counterparty" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Anylegal.ai" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>
Images (additive — python-docx is acceptable here)
Inserting images is an additive operation; python-docx handles relationships and content types correctly:
from docx import Document
from docx.shared import Inches, Cm
doc = Document('/sandbox/input/contract.docx')
doc.add_picture('/sandbox/input/logo.png', width=Inches(2.0))
# Or in a specific paragraph:
para = doc.paragraphs[0]
run = para.add_run()
run.add_picture('/sandbox/input/signature.png', width=Cm(5), height=Cm(2))
doc.save('/sandbox/output/contract_with_images.docx')
This is the only place python-docx is the right tool. For text mutation, always use lxml on the raw XML.
If you need raw XML control (custom positioning, wrapping):
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<wp:docPr id="1" name="Picture 1"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:blipFill>
<a:blip r:embed="rId5"/>
</pic:blipFill>
<pic:spPr>
<a:xfrm><a:ext cx="914400" cy="914400"/></a:xfrm>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
To add via raw XML you must also (1) copy the image to word/media/, (2) add a relationship in word/_rels/document.xml.rels, and (3) register the content type in [Content_Types].xml.
Common pitfalls
- Modifying during iteration: when removing elements with lxml
findall(), iterate over a copy:for elem in list(body.findall(...)). - Namespace-qualified attribute access: use the full URI —
elem.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id'), notelem.get('w:id'). - Missing paragraph mark deletion: if you delete all text from a paragraph but don't mark the paragraph mark as deleted, "Accept All" leaves blank lines.
- Stale element references: after removing/moving elements, cached references may be invalid. Re-query after structural changes.
- ID uniqueness: every
w:idin tracked changes must be unique. Usemax(existing_ids) + 1.
Validation
When run_code produces a DOCX file, validate it before reporting success. The most common validation failures and their fixes:
| Failure | Fix |
|---|---|
<w:t> inside <w:del> |
Use <w:delText> for deleted text, not <w:t> |
Missing xml:space="preserve" |
Add the attribute to <w:t> elements that have leading/trailing whitespace |
| Corrupt ZIP | Ensure zipfile.ZipFile(..., 'w') is properly closed before reading the bytes back |
Schema violation in <w:pPr> |
Verify element order: <w:pStyle>, <w:numPr>, <w:spacing>, <w:ind>, <w:jc>, <w:rPr> (rPr last) |
If the script produces a corrupt DOCX, fix the Python code and retry rather than retrying the same code.
Data + spreadsheet recipes (when run_code is the right tool for non-DOCX work)
Generate an Excel fee schedule
import openpyxl
from openpyxl.styles import Font, PatternFill
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Fee Schedule"
headers = ['Service', 'Rate (USD/hr)', 'Estimated Hours', 'Total']
for col, header in enumerate(headers, 1):
cell = ws.cell(row=1, column=col, value=header)
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = PatternFill(start_color="4472C4", fill_type="solid")
services = [
('Contract Review', 350, 8),
('Negotiation Support', 400, 12),
('Due Diligence', 375, 20),
]
for row, (service, rate, hours) in enumerate(services, 2):
ws.cell(row=row, column=1, value=service)
ws.cell(row=row, column=2, value=rate)
ws.cell(row=row, column=3, value=hours)
ws.cell(row=row, column=4, value=rate * hours)
ws.column_dimensions['A'].width = 25
wb.save('/sandbox/output/fee_schedule.xlsx')
Calculate business day deadlines
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
def add_business_days(start, days):
current = start
added = 0
while added < days:
current += timedelta(days=1)
if current.weekday() < 5:
added += 1
return current
signing_date = datetime(2026, 5, 4)
deadlines = {
"Signing Date": signing_date.strftime("%Y-%m-%d"),
"Effective Date (T+5 business days)": add_business_days(signing_date, 5).strftime("%Y-%m-%d"),
"First Payment Due (30 days)": (signing_date + timedelta(days=30)).strftime("%Y-%m-%d"),
"Warranty Period Ends (12 months)": (signing_date + relativedelta(months=12)).strftime("%Y-%m-%d"),
}
for name, date in deadlines.items():
print(f"{name}: {date}")