wolfmanstout / gaze-ocr

Easily apply OCR to wherever the user is looking onscreen.
Apache License 2.0
29 stars 3 forks source link

Issues running the grammar through dragonfly CLI #1

Closed LexiconCode closed 4 years ago

LexiconCode commented 4 years ago

I'm testing this out on Python 3.8.2 32-bit.

I've tried several different ways to run this. The 2nd and 3rd methods do not function reliably. Is anything further I can do to help troubleshoot?

  1. Running the grammar through natlink through the traditional method works correctly _gaze-ocr.py. (in-process method.) image image

  2. python -m dragonfly load --engine natlink gaze-ocr.py --no-recobs-messages in-process method. image image The OCR seems to be working correctly marked as success and does indeed have a correct gaze location. However no text is highlighted. ocr_data.zip

  3. python -m dragonfly load --engine natlink gaze-ocr.py --no-recobs-messages out-of-process method. (grammars on its own thread) image image

Now what's interesting here several different behaviors appear.

import threading, time
import gaze_ocr
import screen_ocr  # dependency of gaze-ocr

from dragonfly import (
    Dictation,
    Grammar,
    Key,
    MappingRule,
    Mouse,
    Text,
    get_engine
)

# See installation instructions:
# https://github.com/wolfmanstout/gaze-ocr
DLL_DIRECTORY = r"C:\Users\Main\Desktop\ocr_data\dll"

# Initialize eye tracking and OCR.
tracker = gaze_ocr.eye_tracking.EyeTracker.get_connected_instance(DLL_DIRECTORY)
ocr_reader = screen_ocr.Reader.create_fast_reader()
gaze_ocr_controller = gaze_ocr.Controller(ocr_reader, tracker, save_data_directory=r"C:\Users\Main\Desktop\ocr_data")

class CommandRule(MappingRule):
    mapping = {
        # Click on text.
        "<text> click": gaze_ocr_controller.move_cursor_to_word_action("%(text)s") + Mouse("left"),

        # Move the cursor for text editing.
        "go before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Mouse("left"),
        "go after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Mouse("left"),

        # Select text starting from the current position.
        "words before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Key("shift:down") + Mouse("left") + Key("shift:up"),
        "words after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Key("shift:down") + Mouse("left") + Key("shift:up"),

        # Select a phrase or range of text.
        "words <text> [through <text2>]": gaze_ocr_controller.select_text_action("%(text)s", "%(text2)s"),

        # Select and replace text.
        "replace <text> with <replacement>": gaze_ocr_controller.select_text_action("%(text)s") + Text("%(replacement)s"),
    }

    extras = [
        Dictation("text"),
        Dictation("text2"),
        Dictation("replacement"),
    ]

    def _process_begin(self):
        # Start OCR now so that results are ready when the command completes.
        gaze_ocr_controller.start_reading_nearby()

grammar = Grammar("ocr_test")
grammar.add_rule(CommandRule())
grammar.load()

# Force NatLink to schedule background threads frequently by regularly waking up
# a dummy thread.
shutdown_dummy_thread_event = threading.Event()
def run_dummy_thread():
    while not shutdown_dummy_thread_event.is_set():
        time.sleep(1)

dummy_thread = threading.Thread(target=run_dummy_thread)
dummy_thread.start()

# Initialize a Dragonfly timer to manually yield control to the thread.
def wake_dummy_thread():
    dummy_thread.join(0.002)

wake_dummy_thread_timer = get_engine().create_timer(wake_dummy_thread, 0.02)

def unload():
    # ... after unloading the grammar ...
    shutdown_dummy_thread_event.set()
    dummy_thread.join()
wolfmanstout commented 4 years ago

Thanks for the detailed report! I didn't even know about the existence of these alternative ways of invoking Python ... can you point me to where I can learn more about those? Sounds like a really nice way to avoid having to completely restart Dragon in order to reinitialize Python ...

Since I don't exactly know how those work, I'll instead share some generic information about how my system works in case this sparks any debugging ideas:

LexiconCode commented 4 years ago

can you point me to where I can learn more about those? Sounds like a really nice way to avoid having to completely restart Dragon in order to reinitialize Python ...

Not having to restart Dragon is really nice. The Dragonfly (CLI) documentation. Behind-the-scenes uses natlink.waitForSpeech() to run the grammars out of process on their own thread. Another way to achieve this besides the CLI is use Dragonflys engine loaders like dfly-loader-natlink.py Both of these methods do not launch "messages window". Not the official way to run Natlink but as you can see it has some significant advantages.

The only issue I've noticed running out of process importing win32ui freezes Natlink if used out of process. The freeze manifests itself through an absence of recognition through Dragon and natlink's waitForSpeech gui freezes. Ultimately natlink's process has to be ended using to the task manager. After the process is terminated DNS resumes recognition.

There might be a simpler way to make your workaround with threading issues with Natlink. Dane included a helper function to simplify to managing threads. See here for documentation and source

I will get back to with further debugging.

LexiconCode commented 4 years ago
* What is the typical failure mode for the second and third methods you described? Is it like the video you sent me separately, where the cursor does move ... but to the wrong place?

The cursor placement is off and the caret with click and select commands. This by and large seems to be the main mode of failure and is 100% repeatable on my end. So it turns out with click and select failure the behavior is the same as demonstrated in the the video. For example target phrase "susceptible" for both commands the caret with click and select commands.

Start<caret start> 
This is a test
Being human makes us susceptible words ceptib1le to the onset of feelings. 
This is a test

<caret end>

for utterance the words susceptible (I took note to hold my gaze the significant pause before and end utterance)

Gaze point: (1618.846591, 279.274395) Mouse Move[1618, 254] Mouse Move[1718, 254]

for utterance the susceptible click

Gaze point: (1661.408367, 272.171501) Mouse Move[1672, 254]

I suspected in the example above there wasn't enough text to be highlighted or clicked if the difference was too large to the given body of text. Another way to visualize the issue. Below demonstrates an action selecting text with the target word "susceptible".

image

So there seems to be a difference in number of character between the target word and selected characters . Just another data point that might be helpful.

susceptible is 11 characters long

4444444444444 is 13 characters long

The offset of how many lines down from the target word grows per utterance of the command for selection. See vid

susceptible click the same methodology as the video above with 3 utterances of the command.

Gaze point: (1551.106345, 344.777151) Mouse Move[1551, 353]

Gaze point: (1599.236974, 538.596187) Mouse Move[1552, 573]

Gaze point: (1522.033729, 736.742983) Mouse Move[1551, 793]

Functions to print coordinates

class Mouse(object):
    def move(self, coordinates):
        print(("Mouse Move[{}, {}]".format(*coordinates)))
def get_gaze_point_or_default(self):
    if self.has_gaze_point():
        print("Gaze point: (%f, %f)" % self._gaze_point[:2])
LexiconCode commented 4 years ago

I wanted to separate out Natlink out of the speech recognition stack. This should simplify the stack as it's just Dragonfly and your code. I did this using Dragonfly's text engine. The text engine uses Mimic to emulate spoken words as if they are recognized as a spoken utterance.

Running Test Engine from Dragonfly CLI for gaze-ocr.py (your grammar).

python -m dragonfly test _gaze-ocr.py --delay 3

Type commands to emulate as if they are being dictated by voice. 
lowercase mimics `commands`, UPPERCASE mimics `free dictation`
Upper and lowercase words can be mixed e.g `say THIS IS A TEST` 

Edit the `--delay 3` in bat file to change command delay in seconds.
The delay allows user to switch to the relevant application to test commands
INFO:module:CommandModule('_gaze-ocr.py'): Loading module: 'C:\Users\Main\Desktop\_gaze-ocr.py'
Eye tracker connected.
INFO:command:Calls to mimic() will be delayed by 3.00 seconds as specified
INFO:command:Enter commands to mimic followed by new lines.
words SUSCEPTIBLE
Gaze point: (1757.592047, 287.165824)
mouse move[1686, 295]
mouse move[1793, 295]
INFO:command:Mimic success for words: words SUSCEPTIBLE
words SUSCEPTIBLE
Gaze point: (1775.471045, 473.689781)
mouse move[1686, 515]
mouse move[1794, 515]
INFO:command:Mimic success for words: words SUSCEPTIBLE
words SUSCEPTIBLE
Gaze point: (1768.751352, 726.764728)
mouse move[1686, 735]
mouse move[1793, 735]
INFO:command:Mimic success for words: words SUSCEPTIBLE

SUSCEPTIBLE clickresult

SUSCEPTIBLE click
Gaze point: (1703.507924, 290.138132)
mouse move[1739, 295]
INFO:command:Mimic success for words: SUSCEPTIBLE click
SUSCEPTIBLE click
Gaze point: (1711.555348, 491.819858)
mouse move[1739, 515]
INFO:command:Mimic success for words: SUSCEPTIBLE click
SUSCEPTIBLE click
Gaze point: (1715.925631, 675.288658)
mouse move[1740, 735]
INFO:command:Mimic success for words: SUSCEPTIBLE click

The behavior is very close to being identical. I've also tested with and without Force NatLink to schedule background relevant code when utilizing the text engine.

wolfmanstout commented 4 years ago

Thanks again for the detailed information. To confirm: everything works fine when running the standard way, but not via the CLI methods?

Based on the coordinates you printed, it looks like the clicked locations are pretty well aligned with the eye tracking locations, which is taken directly from an API I don't control. That means that both of these agree on the frame of reference, but both are misaligned with the actual screen contents.  The question, then, is what is causing this shift. Do you have a second monitor, or perhaps something docked on the screen that could cause this? If this is indeed only happening with the CLI version, I wonder whether the coordinates are anchored to your command prompt window location for some reason? You could try moving the location of that window around and see if that influences the results.

LexiconCode commented 4 years ago

To confirm: everything works fine when running the standard way, but not via the CLI methods?

Yes that's correct.

You could try moving the location of that window around and see if that influences the results.

Do you have a second monitor, or perhaps something docked on the screen that could cause this?

If it would help to have access to the machine to get a first-hand experience if you can't replicate it yourself you're more than welcome. We could arrange a time and a remote access method over Gitter

wolfmanstout commented 4 years ago

I have a theory: perhaps the Mouse action is misbehaving in this configuration, and clicking on a location relative to the active window. That seems more likely than two different APIs both misbehaving the same way. Can you test some simple Mouse actions such as [0, 0] using this config? That should be the top left corner of the screen. If that doesn't reveal the issue then it'd help to have more details on what happens when you move the foreground window.

Also happy to debug remotely this weekend and/or try to repro your setup.

LexiconCode commented 4 years ago

Mouse actions such as [0, 0] using this config? That should be the top left corner of the screen.

Mouse("[0, 0]").execute() works as expected Top Right Top Right Bottom Right Bottom Right Top Left Top Left Bottom Left Bottom Left

wolfmanstout commented 4 years ago

I was able to reproduce this and figure out what was going on, and it was far more confusing than I ever could have guessed! The cause of the problem is Windows text scaling: if you set this to 100% you won't see any issues. As it turns out, this is broken in both method 1 and method 2, but in different ways for each! (I didn't bother testing method 3.) Here's what's going on:

Method #1: The eyetracker is returning scaled-down coordinates, the screenshot is at full scale, and the Mouse action works as you would expect (no adjustment to the coordinates it is given). Because these last two are synchronized, this mostly works ... unless you position the window towards the bottom left of the screen, in which case the cropped region of the screenshot (which is based on the eyetracker) will be offset enough that it won't include the text you want to select. Method #2: The eyetracker is returning scaled-down coordinates, the screenshot is at full scale, and the Mouse action scales up absolute coordinates it is given in proportion to text scaling. Hence, this has the same failure mode as method 1, but in addition it will also incorrectly position the cursor due to the scaling up of the coordinates it is given.

So, the test with [0, 0] I suggested earlier didn't reveal anything because that's the one coordinate that's not affected by scaling!

I was able to fix this for #1 by adjusting the eyetracker scaling by comparing the screen bounds it reports with what comes from the dragonfly Monitor class. That's now checked into my repository (not yet pushed out -- I'm holding off for a larger release). #2 is still broken, however, for two reasons: (1) the Monitor class returns the wrong size so the eye tracker coordinates are not scaled correctly and (2) the Mouse action still behaves inconsistently with #1. Also, even for use case #1, all bets are off if you adjust scaling after starting everything up. The behavior is extremely bizarre: the desktop resolution from Monitor gets reported incorrectly after adjusting the scale! After seeing all this, I now have full sympathy for any application on Windows that doesn't work properly with text scaling, sadly...

In other news, thank you again for sharing these other methods of loading grammars! What's the advantage of method #3 vs. #2? Even with #2 I was able to pick up changes to Python modules made outside of my grammar. I did, however, notice that some grammars did not work properly in this mode (e.g. saying "number one three" was treated as dictation instead of typing "13", as I've set it up to do).

I removed my threading hack entirely so I can test whether Dane's fixes are enough. Seems to be working fine.

wolfmanstout commented 4 years ago

From reading up some more on how Windows handles high-DPI situations, it looks like the issue is that when Python is run from the command line, Windows treats it as a "high-DPI unaware" application versus when it is embedded in Dragon it is treated as "high-DPI aware" (perhaps inheriting the context from Dragon itself). Hence the core Windows APIs will behave differently. I am able to force the latter case to be DPI aware using the following:

import ctypes
...
ctypes.windll.user32.SetProcessDPIAware()

So far this seems to work fairly well except it is still broken if you try to resize after your grammar has been loaded. After I test this for a little while, I will probably submit this as a pull request to be run inside Dragonfly so that the behavior is consistent. Here's a lovely example of just how misleading these APIs are (and note that this post is in 2015 and there is still no information on either of these APIs as to how they behave relative to a high-DPI display): https://social.msdn.microsoft.com/Forums/sqlserver/en-US/2dc1648d-a731-49f2-8ae5-d486644a62fb/suggestion-for-setcursorpos-and-high-dpi-displays?forum=windowsgeneraldevelopmentissues

wolfmanstout commented 4 years ago

I think this is about "as fixed as it is going to get" now with my latest change. Here's what I will likely submit as a pull request unless I run into issues: https://github.com/dictation-toolbox/dragonfly/compare/master...wolfmanstout:dpi_awareness?expand=1

There are multiple modes of DPI awareness, and this is the "most aware". This means that when running from the command line (method 2), you can even change the DPI multiple times and everything should work properly. It appears that one cannot override DPI awareness when running embedded in Dragon, so in that scenario you are limited to awareness of DPI at startup time only.

wolfmanstout commented 4 years ago

Fixed with commit 75440ed5e6d18dae571f1f086164b8d239c43f41

wolfmanstout commented 4 years ago

I'll leave this open until the Dragonfly pull request is submitted, since that is required to address your original concern.

LexiconCode commented 4 years ago

I think this is about "as fixed as it is going to get" now with my latest change. Here's what I will likely submit as a pull request unless I run into issues: dictation-toolbox/dragonfly@master...wolfmanstout:dpi_awareness?expand=1 (compare)

There are multiple modes of DPI awareness, and this is the "most aware". This means that when running from the command line (method 2), you can even change the DPI multiple times and everything should work properly. It appears that one cannot override DPI awareness when running embedded in Dragon, so in that scenario you are limited to awareness of DPI at startup time only.

This reminds me of an issue we face with Mouse Grids in Caster. https://github.com/dictation-toolbox/Caster/issues/172

    error_code = windll.shcore.SetProcessDpiAwareness(2)  #enable 1-1 pixel mapping
    if error_code == -2147024891:
        raise OSError("Failed to set app awareness")

https://docs.microsoft.com/en-us/windows/win32/api/shellscalingapi/

Perhaps I can use your screen OCR package to reimplement Legion grid that is cross-platform. Just need to extract the bounding boxes.

I will test your changes in the next few days.

LexiconCode commented 4 years ago

I did, however, notice that some grammars did not work properly in this mode (e.g. saying "number one three" was treated as dictation instead of typing "13", as I've set it up to do).

That's good to know! So far have only seen us like handful of people who have had issues with the dictation element. See https://github.com/dictation-toolbox/dragonfly/issues/242. The other issue DictListRef doesn't seem to work without a process. You asked if there was any advantages with method #3 vs. #2. The primary advantage of #3 do away with the Natlink messaging window GUI. in Caster once Python 3 is released with nalink is it allows me to integrate a GUI that is much more advanced. The implementation shows if utterances are recognized as dictation/commands and displays available commands. It would be nice to have it as a standalone be used with any grammar framework but it's integrated with casters CCR system.

Long term it would be nice if CCR was integrated into dragonfly.

I've tested your changes with the gaze-ocr and dragonfly libraries and everything works as expected. Thanks for all your help with this. This issue that we resolved here also highlight and fix bug in the Legion mousegrid. The bug had the same underlying issue with "high-DPI unaware" cmd with dragonfly CLI.

wolfmanstout commented 4 years ago

Fixed by https://github.com/dictation-toolbox/dragonfly/pull/305.