worlddevelopment / exercise_database

Content parts database for upload, automatically splitting, sharing, composition, conversion of e.g. sheets of exercises or solutions in ODT,DOCX,TEX,HTML,... Initiative of University of Wuerzburg Didactics of Mathematics.
http://didaktik.mathematik.uni-wuerzburg.de
1 stars 2 forks source link

Support indexless|hierarchical generic content part declaration patterns. #1

Open faerietree opened 7 years ago

faerietree commented 7 years ago

Definitions

Indexless := no index numbering scheme, i.e. if a number occurs then it is either content or denoting a hierarchy in a markup and not a series. => numbers are explicit (no regular expression) => can only have an implicit ordering. indexed := with index numbering scheme (i.e. explicite order)

Generic := filter by an expression (regex|wildcard|...) Specific := explicit := filter by explicit content (repeating phrase) Raw content := markup content Content := plain text.content, i.e. the visual content like information text, media, ...

Content part declarations

Purpose

They are essential for the worlddevelopment civilization editor, open bookkeeper bot, ...

faerietree commented 7 years ago

This is relevant in this sense:

In some cases it makes sense to use a second pass for content phrase filtering instead of employing very complicated and failure prone regular expressions (e.g. for the Mixed case where it is very complicated and costly to match content and a specific markup node at the same time).

faerietree commented 7 years ago

Allows e.g. hierarchical progressive splitting of sections, i.e. gathering context (!) while on the way to detecting leaf content parts.

Thus in the end the leaf content parts found are the same as in the current weighted score system where hit count (number of content part declarations found) is the most significant factor. As leaf content parts get the most hits in a tree structure like in (XML based) documents these declarations will always get the highest rating. This prevents hierarchical splitting which is required to maintain context. (Which is exactly the purpose of section headers or more general content part declarations! Repeating the context in every leaf content element is highly redundant.)

As only markup stores the hierarchy level, there is no known way around extending the declaration detection on a per sheet document type basis, e.g. ODT, DOCX, MD, RST, ...

In these sets of declarations, the upper most level must get the highest weight by all means besides there are more than 1 occurrences to ensure an top bottom approach which is mandatory here due to the tree structure. (Currently as said above, only leaf content parts are detected in "worded" patterns.)