Open drfho opened 3 months ago
The notebook evaluates a function add_htmlblock_to_docx() for converting richtext to docx-objects. Hint: the current code handles only 2 levels of html children.
https://github.com/zms-publishing/ZMS/pull/288/commits/ec563e7514c2931503c5b1c57d87ae3ba168b082
Hints for https://github.com/zms-publishing/ZMS/pull/288/commits/b5c37c03c90bf81d27702df81f1afc4b582539ec
Certain layout/style properties cannot be added by docx-API, here some helper functions are needed using the XML-API and adding generic opendocument-xml objects:
@zmsdev for discussion: https://github.com/zms-publishing/ZMS/pull/288/commits/7b8517a2a156c981afccfe811d89210068231e2a
A class method standard_json
(py) will generate a JSON represenation of the object's content. The JSON should contain all structural/sematic information to get processed by python-docx . I propose a list of dicts each containing a list of dicts each reprepresenting a fully described block that can be easily transformed to a docx-objects.
This JSON model is obviously too flat for dealing with inline formatting. The object's content block need to be segmented intow:paragraph
and w.run
elements. An inspiration for the structure of the JSON code might be:
http://pdfmake.org/playground.html
The JSONification (done by "standard_json_docx"-attribute) has changed for avoiding code redundancy: For PAGE-containers it is monotonously done now by a function manage_export_docx.get_docx_normalized_json()
wheras the PAGELEMENTS still need a specific standard_json_docx
-method (py-primitive) for creating a content abstraction that easily can be transformed to the DOCX-format. The JSON mainly should cover blocks (docx-paragraphs/tables/images and inline-elements (e.g. em,strong).
For inline-formats (DOCX: 'runs' with 'character formats') ZMS uses a minimal of the corresponding HTML elements.
If the standard_json_docx
attribute is missing the standard_html
is used and minimal HTML-transformation into DOCX objects will be performed:
The append of parsed_xml
does not properly, because doc.element.body.append(parsed_xml)
appends at the end of the xml(!) document - and this might not been the current position where the iterated element should be append. So these elements will occur always at end of all inserted objects when used in the mix with the API function add_paragraph()
or add_picture()
https://github.com/zms-publishing/ZMS/blob/45c18a8aebe3add86ddadd816fb44db29d1d6214/Products/zms/conf/metacmd_manager/manage_export_pydocx/manage_export_pydocx.py#L459-L461
So the xml must be fragmented into runs and added iteratively to new paragrapgh-block: https://github.com/zms-publishing/ZMS/blob/c4237c1bcc9464a760e107babfdcc5c513fa6044/Products/zms/conf/metacmd_manager/manage_export_pydocx/manage_export_pydocx.py#L464-L473
Then from ZMS contentmodel pre-rendered JSON-blocks with xml-coded text-blocks and images like:
[
{
"docx_format":"xml",
"content":"
<w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">
<w:r>
<w:t>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
</w:t>
<w:br/>
</w:r>
</w:p>",
"id":"slides100",
"meta_id":"bt_slide",
"parent_id":"e99",
"parent_meta_id":"bt_carousel"
},
{
"docx_format":"image",
"content":"http://127.0.0.1:8092/myzms2/content/e12/e98/e99/slides100/electricity.jpg",
"id":"slides100_1",
"meta_id":"bt_slide",
"parent_id":"e99",
"parent_meta_id":"bt_carousel"
}
]
will be sequenced correctly in the Word file :
FOR DISCUSSION: @zmsdev, @jklein-dev
Authors often do not organize headline levels in a correct hierarchy, e.g. in practice h3
instead of h2
may follow h1
.
When exporting the content to DOCX following algorithm may fix a wrong headine hierarchy/sequence. I added a jupyter notebook for evaluation:
https://github.com/zms-publishing/ZMS/blob/2da75bca40cce528106831024b87100fcf674ac7/docs/notebooks/snippets_10_pythondocx.ipynb
Although (or because ? ;-) the py-code looks a little bit complicated/long, it seems to work well. The basic ideas are:
Any ideas for shortening/improving the code?
# Example lists of headline levels:
example_lists = [
[1,3,3,4,4,1,3,4,3,1,2],
[1,1,3,2,4,1,2,4,3,5,2,2],
[2,1,3,3,5,3],
[2,1,5],
[3,2]
]
def normalize_headline_levels(list1):
list2 = list1.copy() # Create a copy of list1
l = len(list2)
i = 0
n = 0
# Start with headline level 1
list2[0] = 1
while i < l:
i = (n == 0 or i > n) and i+1 or n + 1
n = 0
if i >= l:
break
v = list2[i]
if v == list1[i-1]:
continue
if v - list1[i-1] > 1 or v - list2[i-1] > 1:
list2[i] = list1[i-1] + 1
if v - list2[i-1] > 1:
list2[i] = list2[i-1] + 1
n = i
if n + 1 >= l:
break
while list1[n+1] == list1[n]:
n += 1
if n + 1 >= l:
break
list2[n] = list2[i]
return list2
for i in range(0, len(example_lists)):
print('%s ==>\n%s\n'%(example_lists[i], normalize_headline_levels(example_lists[i])))
Output:
[1, 3, 3, 4, 4, 1, 3, 4, 3, 1, 2] ==>
[1, 2, 2, 3, 3, 1, 2, 3, 3, 1, 2]
[1, 1, 3, 2, 4, 1, 2, 4, 3, 5, 2, 2] ==>
[1, 2, 2, 2, 3, 1, 2, 3, 3, 4, 2, 2]
[2, 1, 3, 3, 5, 3] ==>
[1, 1, 2, 2, 3, 3]
[2, 1, 5] ==>
[1, 1, 2]
[3, 2] ==>
[1, 2]
Ref: https://github.com/zms-publishing/ZMS/issues/287