weewx / website

Assets used for the weewx.com website
5 stars 4 forks source link

Can the station registry accept UTF-8 in a station description? #7

Open tkeffer opened 2 years ago

tkeffer commented 2 years ago

In /var/log/nginx/error.log, I am seeing a lot of errors of the sort

2022/10/01 12:05:29 [error] 940#0: *105063 FastCGI sent in stderr: "DBD::mysql::st execute failed: Incorrect string value: '\xC4\x81zene...'
 for column 'description' at row 1 at /var/www/html/register/register.cgi line 429" while reading response header from upstream, client: 62.
85.109.69, server: weewx.com, request: "GET /register/register.cgi?station_url=https%3A%2F%2Fwww.wunderground.com%2Fpersonal-weather-station
%2Fdashboard%3FID=IRGASPIL3&description=Atg%C4%81zene%2C+R%C4%ABga%2C+Latvija&latitude=56.9172&longitude=24.0822&station_type=WMR200&station
_model=WMR200A&python_info=3.10.6&platform_info=Linux-5.15.0-1015-raspi-aarch64-with-glibc2.35&weewx_info=4.8.0 HTTP/1.1", upstream: "fastcg
i://unix:/var/run/fcgiwrap.socket:", host: "weewx.com"

What they all seem to have in common is the use of UTF-8 in the field description.

Perl is a mystery to me, so I can't tell if it's rejecting UTF-8 characters.

matthewwall commented 2 years ago

i just tried an alter on the description field to make it utf8. need to do more research to see what cause this behavior to start, but i suspect an upgrade to mysql would have done it.

so far i do not see more messages like this in error.log, but we'll see...

matthewwall commented 2 years ago

here is the table before and after, and the alter command.

alter table `stations` change `description` `description` varchar(255) character set utf8mb4 default NULL;
mysql> show create table stations \G
*************************** 1. row ***************************
       Table: stations
Create Table: CREATE TABLE `stations` (
  `station_url` varchar(255) NOT NULL,
  `description` varchar(255) DEFAULT NULL,
  `latitude` float DEFAULT NULL,
  `longitude` float DEFAULT NULL,
  `station_type` varchar(64) DEFAULT NULL,
  `station_model` varchar(128) DEFAULT NULL,
  `weewx_info` varchar(64) DEFAULT NULL,
  `python_info` varchar(64) DEFAULT NULL,
  `platform_info` varchar(128) DEFAULT NULL,
  `last_addr` varchar(16) DEFAULT NULL,
  `last_seen` int DEFAULT NULL,
  UNIQUE KEY `index_url` (`station_url`),
  KEY `index_addr` (`last_addr`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
mysql> show create table stations \G
*************************** 1. row ***************************
       Table: stations
Create Table: CREATE TABLE `stations` (
  `station_url` varchar(255) NOT NULL,
  `description` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `latitude` float DEFAULT NULL,
  `longitude` float DEFAULT NULL,
  `station_type` varchar(64) DEFAULT NULL,
  `station_model` varchar(128) DEFAULT NULL,
  `weewx_info` varchar(64) DEFAULT NULL,
  `python_info` varchar(64) DEFAULT NULL,
  `platform_info` varchar(128) DEFAULT NULL,
  `last_addr` varchar(16) DEFAULT NULL,
  `last_seen` int DEFAULT NULL,
  UNIQUE KEY `index_url` (`station_url`),
  KEY `index_addr` (`last_addr`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
matthewwall commented 2 years ago

actually, that latin1 is kind of obvious. when did that appear? pretty sure it was not part of the original table creation. so perhaps a mysql upgrade added that to make it explicit, thus breaking any inserts that contain utf8?

sorry, but character encodings are one of my many weaknesses

matthewwall commented 2 years ago

the latin1 is there apparently because i did not specify a characterset when i originally created the tables eons ago, so it defaulted to latin1. apparently mysql 8 (running on weewx.com as of this writing nov 2022) defaults to utf8 (not sure whether that is 3-byte or 4-byte).

i'm not crazy about altering everything in the database. description should obviously accept non-latin characters, but should any other fields?

tkeffer commented 2 years ago

Interesting.

Internally, I'm sure MySQL stores it as a byte string. However, it appears that the database engine is also enforcing character type safety. That is, if you declare the field as Latin-1, then try to store something that uses non-Latin-1 characters, it will declare an error.

When it comes time to print the field out, the MySQL client will use the encoding to figure out what code point to use.

I believe that UTF-8 domain names are valid these days, although most DNS servers don't support them. To future proof things, I would use utf-8 encoding for station_url as well.

The other fields should be fine.

It's an ancient article, but I recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. It set me straight now 20 years ago!