wmo-im / BUFR4

BUFR edition 4
MIT License
27 stars 9 forks source link

non-CCITTIA5 characters in GitHub .csv file(?) #55

Closed jbathegit closed 3 years ago

jbathegit commented 3 years ago

Branch

https://github.com/wmo-im/BUFR4/tree/issue55
https://github.com/wmo-im/CCT/commits/bufrIssue55

Final Proposal

add when ready

Original Discussion Point

Hello @wmo-im/tt-tdcf

Maybe I'm mistaken, but it was my understanding that everything in the csv files should be part of the CCITT IA5 character set(?)
I'm asking because a colleague of mine found a couple of entries in https://raw.githubusercontent.com/wmo-im/BUFR4/master/BUFRCREX_CodeFlag_en_02.csv which aren't part of this character set; specifically, there are ‰ (parts-per-thousand) symbols in the meaning descriptions for code figures 1 and 2 of the code table for 0-02-033.

Is this permissible? One of the problems with having these sorts of things in csv files is that they require multiple bytes to represent them (for example, ‰ is equivalent to hexadecimal e280b0) , which in turn complicates automated string searches and other processing when using a standard keyboard. I realize this is a valid symbol in full UTF-8, but for some reason I thought we were sticking just to CCITT IA5 for the csv files that we distribute, so that every character can be represented within a single byte.

jbathegit commented 3 years ago

Other examples of this are in https://github.com/wmo-im/CCT/blob/master/C11.csv, where there are a lot of place names containing UTF-8, e.g. "La Réunion", "Norrköping", etc.

Please understand that my question only pertains to the .csv files which many users download and ingest for automated processing. And I'm not saying that UTF-8 is a bad thing; rather, it just complicates automated processing. So if it's possible for these files to contain characters that can't be represented in a single byte, then we probably should make that clear to users so their parsers can take that into account. At least in prior iterations, the xml files contained an attribute explicitly noting that the character set was utf-8, so if we're also going to allow such characters in the csv files, then maybe we just need a similar explicit note, though I'm not sure how to do that within a csv file(?)

On a similar note, I seem to recall that our practice going forward is going to be that we, as team members, are going to be individually responsible for manually editing the .csv files in a new branch for any changes that we propose, so that those changes can be tested for validation and potential later merging into the main branch. So, again, do we want to allow the full UTF-8 character set in these files, given that most such characters can't be typed from a standard keyboard?

efucile commented 3 years ago

Dear @jbathegit thanks for rising this. I agree that we need to be clear on the characters set used for our tables. I made a quick check and I can confirm that going back to 2016 the encoding is utf-8. I understand that the problem here is that in xml it was explicitly stated encoding=utf-8 while in the csv files needs to be detected by the user software. You can check from command line in linux environments file -I will give you encoding=utf-8 because there is a header in the txt file declaring the encoding. However, from our programs we need to know the encoding to use the appropriate calls to deal with utf-8. Therefore I agree that we need to state it clearly, not sure where.

@amilan17 it is also important to provide indication to the TT members on how they can change the files in the repository if we want them to do that. If you pull a file on your computer change it using excel and push it back in most of the cases it will be broken for some reason because excel is not doing a good job with csv with utf-8. I had this problem with the wigos metadata. The correct way is to use an editor that is preserving the encoding or using gitHub web editing. However, if by mistake someone is messing with the file encoding and pushing in git, a big number of changes difficult to understand will appear and this should trigger an inspection and possible some fixes.

SimonElliottEUM commented 3 years ago

@jbathegit @efucile we need to recall that the units used in Table B need to be describable in CCITT-IA5 otherwise we could not use 0-00-015 to describe them. Being able to exchange the Tables themselves and their components in BUFR is a vital feature and so we must make sure that either the units are expressed using CCITT-IA5 character set, or we introduce new descriptors in Table 0 for this purpose, with units of UTF-8

efucile commented 3 years ago

@SimonElliottEUM you are right. Is it only for the units? I am afraid that this is the case for all the tables. I think we need to decide a way forward here. We may need to change the format of the csv and xml tables to CCITT-IA5. A question for @wmo-im/tt-tdcf . I always forget about this feature of encoding the tables in BUFR because we stopped using it at some point, but I imagine that it is used by many centres. If this is the case we should distribute the tables in BUFR own machine readable format which is BUFR itself. Do we need to have the tables in BUFR format distributed alongside csv and xml? This would ensure that we have the required encoding. My point here is that it is difficult guarantee that we satisfy a requirement without using it.

jbathegit commented 3 years ago

Maybe I'm mistaken, but I think in most cases we're OK with the units. For example, in https://raw.githubusercontent.com/wmo-im/BUFR4/master/BUFRCREX_TableB_en_11.csv, the wind speeds are listed as "m/s" (which are all CCITTIA5 characters), rather than "m s⁻¹". Or in https://raw.githubusercontent.com/wmo-im/BUFR4/master/BUFRCREX_TableB_en_13.csv, accumulations and rates are listed as "kg m-2" or "kg m-2 s-1" respectively, rather than "kg m⁻²" or "kg m⁻² s⁻¹". So in most cases I think we're OK in terms of units for the csv and xml files. However, that said, I did spot a units indicator of "‰" for 0-15-028 within https://github.com/wmo-im/BUFR4/blob/master/BUFRCREX_TableB_en_15.csv, so there are occurrences of non-CCITTIA5 content in units fields, and therefore @SimonElliottEUM's point is a good one.

But in general, and going back to my original query, I still think our bigger problem is with meaning strings. For example, and continuing along the line of @SimonElliottEUM's argument, his point about Table 0 becomes even more relevant in the context of 0-00-025 and 0-00-027, which also explicitly state that the contents should be CCITTIA5.

jbathegit commented 3 years ago

Per @DavidBerryNOC's helpful suggestion during today's TT-TDCF telecon, I agree that we can use entries in Common Code Table C-6 (from the "Abbreviation in IA5/ASCII" column) to resolve this for anything involving units. For example, as noted in entry 301 of Common Code Table C-6, we can simply replace "‰" with "0/00" everywhere it occurs in any units field (e.g. for 0-15-028) or in any meaning description (e.g. in code figures 1 and 2 of the code table for 0-02-033), and then all such occurrences can now be encoded using CCITT-IA5.

However, we would still need to figure out a solution for other items such as place names in Common Code Table C-11 (e.g. "La Réunion", "Norrköping") since these special characters aren't part of the CCITT-IA5 character set either, and they also aren't units so Common Code Table C-6 doesn't provide us with a solution for these cases.

amilan17 commented 3 years ago

next steps: identify all non-CCITT-IA5 characters in Table B (and possibly Table D) CSVs and replace

SimonElliottEUM commented 3 years ago

I have some general observations:

SimonElliottEUM commented 3 years ago

Here are the entries we need to address in the common code tables: C-1 / C-11: 25, 46, 74, 82, and 167 C-2: 73/173 C-5: 530 (there is a tab in the XML file) C-12: 46/10, 46/11, 46/12 ... 46/25, 147/10, 74/29, 74/30, 74/35, 74/36, 78/221, 78/226, 78/228, 78/236, 85/201, 85/202, and 254/170 C-14: 13, 14, 10000, 10001, 10026, 10030, 10036, 10037, 10039, 10044, 10045, 10048, 10050, 10051, 10053, 10056, 10057, 20020, 40000, 40001, 60000, 60001, 60012, 60022, 60023, 60024, 60025, 60026, and 60029

SimonElliottEUM commented 3 years ago

Table A seems fine. In Table B we need to address 0-15-028 and 0-15-054

SimonElliottEUM commented 3 years ago

In the code and flag tables the following need attention: 0-01-036: 250001, 250002, 540001, and 724001 0-01-101: 112 and 418 0-02-033: 1 and 2 0-08-037: 0 0-19-109: 0, 1, 2, ..., 8, and 9 0-20-003: 230 0-20-063: 35, 36, 37, 38, 61, 62, 63, 64, 65, 66, 67, and 87 0-20-089: 2 to 88 (there is an ellipsis) 0-20-136: 14, 15, 16, 17, 18, 19, and 23 0-21-069: 1, 2, 3, 4, 5, 6, 7, and 8 0-33-071: 2 (invisible character between 25 and DU) 0-40-055: 16 and 17

SimonElliottEUM commented 3 years ago

@DavidBerryNOC @SibylleK @richardweedon @jbathegit @amilan17 I hope above comments are giving us a starting point. I would be happy to go through the items in a dedicated meeting at some point - it might be a good idea

jbathegit commented 3 years ago

A dedicated meeting may indeed be a good idea, if nothing else to make sure we're all on the same page and we can also then divide up the work tasks among ourselves.

In the example case of 0-01-013, it and many other descriptors with similar units (e.g. see Class 11) currently use "m/s". We could continue that sort of notation for all descriptors, by always using a forward slash to delineate between numerator and denominator; however, in my opinion that doesn't extend well to units with multiple powers, or with multiple units in either the numerator or denominator. I also think we need a space between units, e.g. use "m s-1" instead of "ms-1", because otherwise things can get a bit ambiguous when you have unit names with multiple letters such as "kg", and in which case you could end up with something like "kgm-2s-1" instead of the more readable "kg m-2 s-1".

So I think the best way forward would be to switch everything to explicit exponents, in the form of either "m s-1" or "m s-1" or "m s^-1". My personal preference would be to use the first notation with "" to denote exponents, since that agrees with how it is coded in many computer languages. We have 192 bits (=24 bytes) to work with in the 0-00-015 descriptor, so I think we should have plenty of room.

SimonElliottEUM commented 3 years ago

@jbathegit Thanks for the input - a quick first response concerning 0-01-013: in my code table (PDF from WMO) it is shown with units of "m s" with a "-1" superscript immediately after the s. There is no "/". It is the superscript which cannot be reflected in ASCII, giving "m s-1". I expect the space between "m" and "s" is to disambiguate the cases of meters per second and milliseconds.

We also have subscripts (e.g. in the names of 0-13-099) and fractions inside the superscripts (e.g units of 0-11-075). The use of subscripts to show the base used for logarithms may not be an issue of we compare the names of 0-15-009 and 0-15-011, but elsewhere we need to consider them (e.g. 0-41-001)

jbathegit commented 3 years ago

@SimonElliottEUM I hadn't looked at the PDF (in fact, I don't even know what URL the PDF lives at anymore? ;-); rather, I was looking at the csv files, because those are the basis for the machine-readable codes, and which in turn is what I thought we were supposed to be focusing on here. The csv files are in GitHub at https://github.com/wmo-im/BUFR4, and the XML machine-readable files are in https://community.wmo.int/activity-areas/wmo-codes/manual-codes/latest-version. In both of those locations, 0-01-013 and many other descriptors are listed with "m/s" units.

Either way, I agree that superscripts are the real issue, and that's why I was trying to suggest a way that we could represent these in CCITT-IA5, but in a way that's easily extensible and still readable and unambiguous for units with multiple powers and/or multiple letters in a unit name.

You raise a good point about subscripts and fractions, as those indeed will also need to be taken into account. Maybe for the units of 0-11-075 we could use "m(2/3) s-1"? For 0-41-001, the best we may be able to do is just "pCO2", unless someone else has a better idea(?)

SimonElliottEUM commented 3 years ago

@jbathegit I had understood that the PDF version is The Version of the Manual on Codes, and that in the event of any divergence, that the PDF version takes precedence (from the website showing the Latest Version of the Machine Readable Codes for Manual on Codes, Volume I.2, note 4: "If there are inconsistencies between the same entries available below and the Manual on Codes, Volume I.2 (WMO-No. 306), the Manual prevails"). The PDF Manual is available here https://library.wmo.int/doc_num.php?explnum_id=10310

jbathegit commented 3 years ago

@SimonElliottEUM the PDF version is indeed the official arbiter for any questions about the BUFR (or GRIB) regulations, or for cases where there is any disagreement on the attributes of a table entry that is listed in both locations (i.e. in both the PDF and the machine-readable files).

However, and that said, the tables in the PDF aren't updated every time we have a fast-track version update; for example, they don't contain any of the new BUFR Table B and Table D entries from FT-2020-2 (version 35) last November. So they're not always completely up-to-date, and in such cases the machine-readable (i.e. the csv and xml) versions are all we have to go by if we want to have the most thorough and up-to-date inventory of all possible descriptors.

SimonElliottEUM commented 3 years ago

@jbathegit we are in agreement here. And I think we also agree that notes and footnotes (plus entire Table C and Table D) are out of scope.

efucile commented 3 years ago

@wmo-im/tt-tdcf there is work on units being done by @wmo-im/tt-wigosmd that may be relevant for this issue. @amilan17 can you please add here the link to the issue on units that is being discussed by TT-WIGOSMD?

SimonElliottEUM commented 3 years ago

@amilan17 @efucile @jbathegit our work here is not just the units but also the names. In terms of using CCITT-IA5 for class 0, we need to handle names and units in the same way (see 0-00-013 and 0-00-014)

david-i-berry commented 3 years ago

I'm just catching up with the conversation between calls. I've been able to detect two more issues in addition to ones flagged by Simon. using e.g.

pcregrep --color='auto' -n "[^[:ascii:]]" TableB_en.txt

on the different tables.

C-1 / C-11: 155 San José C-14: 60021 (60012 in Simons post)

I think using ** is clearer to note powers with fractions in brackets. pCO2 is fine I think and what we currently have, this is also related to the issues on C-14 and how the chemical constituents are represented.

jbathegit commented 3 years ago

Hi @amilan17, would it be possible for you to create an issue55 branch off of master for this, so we can start going in and making the necessary edits to the csv files? My hope is that we can get this done and included in the upcoming FT-2021-2.

For the rest of you on this thread, if any of you are willing to help make these edits, please let me know and we can divvy up the work. As noted earlier in the thread, I'm also open to the idea of a meeting, but it may also be possible to coordinate this effort via this GitHub issue, depending on how many folks are willing to participate. If we have a dedicated branch, then we can always work in parallel and stack our individual updates as separate commits on the new branch, then when we're done we can submit a PR for Anna or Enrico to pull them all back in to the master branch.

amilan17 commented 3 years ago

@jbathegit

https://github.com/wmo-im/BUFR4/tree/issue55

SimonElliottEUM commented 3 years ago

@jbathegit - I would support a short meeting just to share out the work and agree the preferred approach for the most prevalent issues (basically subscripts and superscripts)

jbathegit commented 3 years ago

@DavidBerryNOC @SibylleK @SimonElliottEUM @richardweedon

OK, I set up a Doodle poll at https://doodle.com/poll/spz3dpxstbt7qy9m?utm_source=poll&utm_medium=link to see if we could find a time for all of us to get together during the next couple of weeks for a short coordination meeting. Per the announced schedule during the last TT-TDCF meeting, we'll need to get this task completed by June 18th if we want any chance of getting it included in FT-2021-2.

Once we have an agreed time, I can set up a Google Meet if that will work for everyone. Please let me know if for some reason you cannot use Google Meet.

jbathegit commented 3 years ago

Thanks @amilan17 for creating an issue branch for us to work with!

jbathegit commented 3 years ago

A discussion meeting took place today between @jbathegit, @SimonElliottEUM, @SibylleK, @DavidBerryNOC and @lemkhenter. Going forward, we agreed to update all units in the CSV files to be consistent with what is currently listed in the "Abbreviation in IA5/ASCII" column of Common Code Table C-6, even if those entries themselves aren't always internally consistent with each other. We then divided up the work, and each of us will make our own separate commits to the respective CSV files, which can then all be eventually merged from the issue55 branch to the master BUFR4 branch via a single pull request.

@sebvi, when we last discussed this issue during the April TT-TDCF sessions, I recall you mentioning that you could provide a reference to a standard way to represent characters such as "é", "ó", "ü", "ä", "ã", "ñ", etc. in CCITTIA5. Could you please share that list with us, for use in updating the C-1, C-11 and C-12 entries?

@amilan17, when you get a chance, we will also need an issue55 branch off of the master CCT branch as well, for the corresponding changes that we'll need to make to the Common Code Table CSV files.

Thanks everyone!

joergklausen commented 3 years ago

Please consult https://github.com/wmo-im/wmds/issues/159 where this is also addressed. I believe, the 'notations' used there all comply with IA5/ASCII.

lemkhenter commented 3 years ago

For the names of the places (countries, cities and others) I agree with @sebvi, a column with the local writing with accents and an uppercase column without accents (or we use the English names relating to these places instead of local names in the same way we don't write Tokyo in Japanese or Casablanca in Arabic).

jbathegit commented 3 years ago

Good question @DavidBerryNOC! I think that longer meanings (longer than 62 bytes) could be dealt with by using the 2-08-YYY operator from Table C in front of 0-00-025 (or 0-00-027). That would allow the encoding of any meaning string up to 255 bytes long within a BUFR message where/when needed.

The bigger issue in my mind is still the UTF-8 characters within many of these strings. We definitely need CCITTIA5 equivalents for use when encoding using 0-00-025 (or 0-00-027).

david-i-berry commented 3 years ago

Good question @DavidBerryNOC! I think that longer meanings (longer than 62 bytes) could be dealt with by using the 2-08-YYY operator from Table C in front of 0-00-025 (or 0-00-027). That would allow the encoding of any meaning string up to 255 bytes long within a BUFR message where/when needed.

The bigger issue in my mind is still the UTF-8 characters within many of these strings. We definitely need CCITTIA5 equivalents for use when encoding using 0-00-025 (or 0-00-027).

Thanks @jbathegit, I realised this just after posting and so deleted the message almost straight away (but not quick enough :-)). For others reading this I noted the width limit for 00-00-025 (62 characters) forgetting about the operators and queried whether it was an issue (it probably isn't but there are a couple with > 255 characters).

david-i-berry commented 3 years ago

The bigger issue in my mind is still the UTF-8 characters within many of these strings. We definitely need CCITTIA5 equivalents for use when encoding using 0-00-025 (or 0-00-027).

I guess the question here is whether we need CCITT IA5 equivalents or do we need equivalents that use only single byte printable characters? I think what we want is the latter (single byte printable character) as diacritics can be used in CCITT IA5 based on the text in the Manual on GTS (WMO 386) and extract from the CCITT blue book (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.50-198811-S!!PDF-E&type=items). To use this we would need to add some extra text to the manual on codes specifying the values for the diacritic marks. For example, we'd need some words stating that the following are used when using the CCITT IA5 character set:

Graphic Name Coded representation
´ accute accent 4/0
` grave accent 5/11
¨ umlaut 5/14

etc using the notation given in WMO 386. The accented character can then be represented by the character to be accented, the back space character and then the accent (or other) mark. The downside to this is that it requires 3 bytes per character and the use of non-printing characters (backspace). The extended ASCII character set / latin1 (ISO 8859-1) includes single byte characters with diacritics, I don't know if this would be an option for the CSV files or whether there would still be the problems with the automated string searches and other processing that started this issue.

yg31 commented 3 years ago

I'm not sure that replacing a standard multi-byte representation like utf-8 with another non standard multi-byte representation would be much better. Accented letters for french text are a recurrent problem with computer systems, unless we use standards like utf-8 or html entities. Omitting accents can be an alternative (not ideal, but this is a common practice when writing in capital letters LA REUNION: accented capitals are not easy to type), but there may be some other ways like adding e after a letter to replace umlaut (for example ä becomes ae). with the same view, replacing characters like Œ with a sequence of 2 characters OE usually doesn't introduce much trouble.

SimonElliottEUM commented 3 years ago

I have made the necessary changes directly in the branch to the following code and flag tables: 0-20-003 0-20-063 0-20-089 0-20-136 0-21-069 0-33-071 0-40-055 @SibylleK I stole 0-20-003 from you by mistake when I was doing class 20. Es tut mir Leid.

jbathegit commented 3 years ago

I've made the necessary updates to C-5 (in the CCT issue branch) and to Table B Class 15 for 0-15-028 and 0-15-054 (in the BUFR4 issue branch).

@SimonElliottEUM you had mentioned earlier that entry 73/173 in C-2 also needed attention, but I don't see anything there that looks like it needs to be fixed, unless I missed something(?) At least in looking at the C-2 csv file, everything looks fine to me. The same entry in the PDF also looks fine to me.

jbathegit commented 3 years ago

@amilan17 would you like me to just go ahead and manually add a new column to each of the C-1, C-11 and C-12 csv files, and maybe just give it a name with a _ia5 extension? For example, in C-11 we currently have OriginatingGeneratingCentre_en as the name of the column containing the diacritic characters, so maybe we could call the corresponding new column which has all of these characters replaced with printable CCITTIA5 characters OriginatingGeneratingCentre_en_ia5?

Please let me know how you want me to proceed with this - thanks!

SimonElliottEUM commented 3 years ago

@SimonElliottEUM you had mentioned earlier that entry 73/173 in C-2 also needed attention, but I don't see anything there that looks like it needs to be fixed, unless I missed something(?) At least in looking at the C-2 csv file, everything looks fine to me. The same entry in the PDF also looks fine to me.

@jbathegit I no longer see any issue with 73/173 in C-2. I checked various other incarnations of the table, but they are all fine too. No further action needed for this case.

amilan17 commented 3 years ago

@jbathegit -- yes, please go ahead and create new columns. you can drop "_en" from the header names. Let me know if you have any questions.

jbathegit commented 3 years ago

I've made all the updates to C-1, C-11, and C-12, including the additional new _ia5 column in each one as discussed.

So I believe what still remains to be updated between now and June 24th are the following:

BUFR4 Code tables (@SibylleK) 0-01-036: 250001, 250002, 540001, and 724001 0-01-101: 112 and 418 0-02-033: 1 and 2 0-08-037: 0 0-19-109: 0, 1, 2, ..., 8, and 9

CCT C-14 (@DavidBerryNOC) 13, 14, 10000, 10001, 10026, 10030, 10036, 10037, 10039, 10044, 10045, 10048, 10050, 10051, 10053, 10056, 10057, 20020, 40000, 40001, 60000, 60001, 60021, 60022, 60023, 60024, 60025, 60026, and 60029

david-i-berry commented 3 years ago

I'll go through CCT C-14 this week, most of the entries appear to be radicals. It looks like the way to represent this in CCITT IA-5 is to replace the middle dot with a period character. @sebvi I recall (possibly incorrectly) that you had commented on this table before, if so do you know if this would be the correct way to represent the radicals?

SibylleK commented 3 years ago

I made the changes of 0-02-033, 0-08-037 and 0-19-109 and also change the unit in 0/00 of table B descriptor 015028. But I struggle a little bit with the changes in 0-01-036 (Agency in charge) and 0-01-101 (State identifier), as it was discussed for C-1 etc to add another column. But if we consider the Code tables we would have to add an additional column for each Code table for consistency. Therefore I think we should either keep the non-CCITTIA5 characters for the 6 entries or change them without adding a column.

@SimonElliottEUM offered to make these changes if desired. Thank you very much!

jbathegit commented 3 years ago

I don't think we need to necessarily be concerned that all code tables have the exact same number of fields in the machine-readable files (csv and xml). If some such tables contain an additional _ia5 column (in csv) or tag (in xml), then any automated parser should be able to adjust without too much trouble and without breaking altogether. Otherwise we cannot even make similar updates to C-1 and C-11, since those are also code tables for 0-01-031, 0-01-033 and 0-01-035.

That said, if the rest of you are concerned about this(?), then I'm fine with just changing the meaning strings of those 6 entries in place (i.e. to remove the non-CCITTIA5 content) without adding an extra column. The important thing is that we need to have at least one meaning field for each code table that contains only CCITTIA5 characters, in order to preserve compatibility with Class 0 of Table B.

On a related note, @DavidBerryNOC how are you coming along with the C-14 updates?

lemkhenter commented 3 years ago

I agree with @jbathegit and I propose for tables C1 and C11 either to replace the non-CCITT5 character with the closest CCITT5 character or to transform the whole column into uppercase characters knowing that most languages ​​remove accents for uppercase characters .

jbathegit commented 3 years ago

For C-1 and C-11 I've already added in a separate _ia5 field, as we agreed earlier in this thread. So there are now two meaning columns in the csv files (i.e. OriginatingGeneratingCentre and OriginatingGeneratingCentre_ia5) as well as two corresponding meaning tags in the corresponding xml files for those tables.

My personal opinion is that we should be able to do likewise for regular code/flag tables, but if the rest of you feel otherwise, then I'm fine with just changing the meaning strings of those 6 particular entries that @SibylleK was asking about.

SimonElliottEUM commented 3 years ago

I am in favour of keeping things as simple as possible, and to replace the non-CCITTIA5 character with the closest CCITTIA5. No need for an extra column for the meaning here. As @SibylleK is away until 7 July (nice) I will implement whatever we agree here - just let me know.

david-i-berry commented 3 years ago

I've finished editing C-14, there are two entries that I struggled with and haven't changed:

40000 | Singlet sigma oxygen (dioxygen (sigma singlet)) | O2(1Σ+g) | Operational
40001 | Singlet delta oxygen (dioxygen (delta singlet)) | O2(1Δg) | Operational

I don't know the context of the greek letters well enough and am not sure whether replacing them would change the meaning.

The other changes I've made are to:

13, 14, 10000, 10001, 10026, 10030, 10036, 10037, 10039, 10044, 10045, 10048, 10050, 10051, 10053, 10056, 10057, 20020, 60000, 60001, 60021, 60022, 60023, 60024, 60025, 60026, and 60029

I've replaced the radical middle dot with a period, single angled quote with a straight quote (and wrapped field in double quotes) and replaced the symbol alpha with the word alpha.

david-i-berry commented 3 years ago

On a bit more thinking it might be easiest to replace the Greek characters with html encoding, e.g. \Σ for upper case sigma (Σ) etc.

jbathegit commented 3 years ago

@DavidBerryNOC that seems like a reasonable approach, since & and ; are in the CCITTIA5 character set.

@SimonElliottEUM I believe it's fine to just replace the non-CCITT5 character with the closest CCITT5 character for those 6 particular code table entries, as previously suggested by @lemkhenter.

Thanks everyone for your contributions towards resolving this issue!

sebvi commented 3 years ago

I've finished editing C-14, there are two entries that I struggled with and haven't changed:

40000 | Singlet sigma oxygen (dioxygen (sigma singlet)) | O2(1Σ+g) | Operational
40001 | Singlet delta oxygen (dioxygen (delta singlet)) | O2(1Δg) | Operational

I don't know the context of the greek letters well enough and am not sure whether replacing them would change the meaning.

The other changes I've made are to:

13, 14, 10000, 10001, 10026, 10030, 10036, 10037, 10039, 10044, 10045, 10048, 10050, 10051, 10053, 10056, 10057, 20020, 60000, 60001, 60021, 60022, 60023, 60024, 60025, 60026, and 60029

I've replaced the radical middle dot with a period, single angled quote with a straight quote (and wrapped field in double quotes) and replaced the symbol alpha with the word alpha.

@DavidBerryNOC I think it is not correct to change the "radical" middle dot by a period "." .If "*" is part of the correct set, then it is a much much better way to represent the unpaired electron of the radical.

The notation with Greek letters are molecular electronic states. Without going into details, it tells you in which molecular orbitals the electrons are and information on spin and symmetry. (see https://en.wikipedia.org/wiki/Molecular_term_symbol )

david-i-berry commented 3 years ago

@sebvi thanks, I've gone through and changed the periods to asterix. I've left the Greek letters in their html encoding.