zms-publishing / ZMS

Running on Python 3.8+
11 stars 5 forks source link

Word/DOCX-Eport: notebook prototype #288

Open drfho opened 3 months ago

drfho commented 3 months ago

Ref: https://github.com/zms-publishing/ZMS/issues/287

drfho commented 3 months ago

The notebook evaluates a function add_htmlblock_to_docx() for converting richtext to docx-objects. Hint: the current code handles only 2 levels of html children.

https://github.com/zms-publishing/ZMS/pull/288/commits/ec563e7514c2931503c5b1c57d87ae3ba168b082

image

drfho commented 3 months ago

Hints for https://github.com/zms-publishing/ZMS/pull/288/commits/b5c37c03c90bf81d27702df81f1afc4b582539ec

Certain layout/style properties cannot be added by docx-API, here some helper functions are needed using the XML-API and adding generic opendocument-xml objects:

  1. document-field PAGE number: example code add page couter to footer
  2. paragraph property borders: example code adds a bottom border to the added style "Description"

image

drfho commented 3 months ago

@zmsdev for discussion: https://github.com/zms-publishing/ZMS/pull/288/commits/7b8517a2a156c981afccfe811d89210068231e2a

A class method standard_json (py) will generate a JSON represenation of the object's content. The JSON should contain all structural/sematic information to get processed by python-docx . I propose a list of dicts each containing a list of dicts each reprepresenting a fully described block that can be easily transformed to a docx-objects.

image

TASK: JSONify mix of paragraphs (block) and runs (inline)

This JSON model is obviously too flat for dealing with inline formatting. The object's content block need to be segmented intow:paragraph and w.run elements. An inspiration for the structure of the JSON code might be: http://pdfmake.org/playground.html

image

drfho commented 3 months ago

The JSONification (done by "standard_json_docx"-attribute) has changed for avoiding code redundancy: For PAGE-containers it is monotonously done now by a function manage_export_docx.get_docx_normalized_json() wheras the PAGELEMENTS still need a specific standard_json_docx-method (py-primitive) for creating a content abstraction that easily can be transformed to the DOCX-format. The JSON mainly should cover blocks (docx-paragraphs/tables/images and inline-elements (e.g. em,strong). For inline-formats (DOCX: 'runs' with 'character formats') ZMS uses a minimal of the corresponding HTML elements.

If the standard_json_docxattribute is missing the standard_htmlis used and minimal HTML-transformation into DOCX objects will be performed:

image

drfho commented 3 months ago

The append of parsed_xml does not properly, because doc.element.body.append(parsed_xml) appends at the end of the xml(!) document - and this might not been the current position where the iterated element should be append. So these elements will occur always at end of all inserted objects when used in the mix with the API function add_paragraph() or add_picture() https://github.com/zms-publishing/ZMS/blob/45c18a8aebe3add86ddadd816fb44db29d1d6214/Products/zms/conf/metacmd_manager/manage_export_pydocx/manage_export_pydocx.py#L459-L461

So the xml must be fragmented into runs and added iteratively to new paragrapgh-block: https://github.com/zms-publishing/ZMS/blob/c4237c1bcc9464a760e107babfdcc5c513fa6044/Products/zms/conf/metacmd_manager/manage_export_pydocx/manage_export_pydocx.py#L464-L473

Then from ZMS contentmodel pre-rendered JSON-blocks with xml-coded text-blocks and images like:

[
    {
        "docx_format":"xml",
        "content":"
        <w:p xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">
            <w:r>
                <w:t>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
                </w:t>
                <w:br/>
            </w:r>
        </w:p>",
        "id":"slides100",
        "meta_id":"bt_slide",
        "parent_id":"e99",
        "parent_meta_id":"bt_carousel"
    },
    {
        "docx_format":"image",
        "content":"http://127.0.0.1:8092/myzms2/content/e12/e98/e99/slides100/electricity.jpg",
        "id":"slides100_1",
        "meta_id":"bt_slide",
        "parent_id":"e99",
        "parent_meta_id":"bt_carousel"
    }
]

will be sequenced correctly in the Word file :

carousel_docx

drfho commented 2 months ago

FOR DISCUSSION: @zmsdev, @jklein-dev Authors often do not organize headline levels in a correct hierarchy, e.g. in practice h3 instead of h2may follow h1. When exporting the content to DOCX following algorithm may fix a wrong headine hierarchy/sequence. I added a jupyter notebook for evaluation: https://github.com/zms-publishing/ZMS/blob/2da75bca40cce528106831024b87100fcf674ac7/docs/notebooks/snippets_10_pythondocx.ipynb

Although (or because ? ;-) the py-code looks a little bit complicated/long, it seems to work well. The basic ideas are:

  1. it starts with h1 (level = 1)
  2. leveling down must be +1
  3. on any down-jump >+1 same levels are lifted up in a sequence

Any ideas for shortening/improving the code?

# Example lists of headline levels:
example_lists = [
    [1,3,3,4,4,1,3,4,3,1,2],
    [1,1,3,2,4,1,2,4,3,5,2,2],
    [2,1,3,3,5,3],
    [2,1,5],
    [3,2]
]

def normalize_headline_levels(list1):
    list2 = list1.copy()  # Create a copy of list1
    l = len(list2)
    i = 0
    n = 0
    # Start with headline level 1
    list2[0] = 1
    while i < l:
        i = (n == 0 or i > n) and i+1 or n + 1
        n = 0
        if i >= l:
            break
        v = list2[i]
        if v == list1[i-1]:
            continue
        if v - list1[i-1] > 1 or v - list2[i-1] > 1:
            list2[i] = list1[i-1] + 1
            if v - list2[i-1] > 1:
                list2[i] = list2[i-1] + 1
            n = i
        if n + 1 >= l:
            break
        while list1[n+1] == list1[n]:
            n += 1
            if n + 1 >= l:
                break
            list2[n] = list2[i]
    return list2

for i in range(0, len(example_lists)):
    print('%s ==>\n%s\n'%(example_lists[i], normalize_headline_levels(example_lists[i])))

Output:

[1, 3, 3, 4, 4, 1, 3, 4, 3, 1, 2] ==>
[1, 2, 2, 3, 3, 1, 2, 3, 3, 1, 2]

[1, 1, 3, 2, 4, 1, 2, 4, 3, 5, 2, 2] ==>
[1, 2, 2, 2, 3, 1, 2, 3, 3, 4, 2, 2]

[2, 1, 3, 3, 5, 3] ==>
[1, 1, 2, 2, 3, 3]

[2, 1, 5] ==>
[1, 1, 2]

[3, 2] ==>
[1, 2]