how to fully automatize any ui task? there are many methods. 1) visual ai can locate ...

npub122c…hv33

2024-01-31 10:46:54

how to fully automatize any ui task? there are many methods.

1)
visual ai can locate ui elements, it could detect which button to press
it could feed this information to eg web browser automation tool, but there are difficulties to get ai models to output perfect information for logic applications -> eg.

prompt: button to send new note?
model: New Notee
oops, there is typo --> logic application fails, because it requires perfect information

although this limitation can be solved with better, more accurate models
still there is limitation of application interfaces
is web browser automation enough?

to fully generalize automation tasks, lets think about solution
2)

can we get ai to control input methods?
lets think about controlling mouse.

ai model can visually see where is the button it needs to press, then it can see where the cursor is.

we could even feed information from visual model to object detection style model, which would output coordinates of ui element. however this all is probably solvable with single model.

we need to combine visual model with coordination model
we need to be able to

prompt: "where is the new note button"
response: x: 200, y: 900

visual model with combined coordination can be tasked for any ui automation task for any application

still what is left for really autonomous agent:
- task flow control, how is the task proceeding?
- goal control, what is the goal we want to accomplish? (generally this comes from human controlling the model, unless we want to build real agi)

Author Public Key

npub122ce3y27qw6c323r38hk8easz5l3uvxqusqvnmsjfv8sga42878ssyhv33

Seen on

wss://relay.nostr.band

Show more details

Published at

2024-01-31 10:46:54

Kind type

1 Short Text Note

Event JSON

{ "id": "e6b8f1c1d920ed4b14b5c19e26ddc31daf5dbe24cb9c395e753f9f3475febd61", "pubkey": "52b198915e03b588aa2389ef63e7b0153f1e30c0e400c9ee124b0f0476aa3f8f", "created_at": 1706698014, "kind": 1, "tags": [], "content": "how to fully automatize any ui task? there are many methods.\n\n1)\nvisual ai can locate ui elements, it could detect which button to press\nit could feed this information to eg web browser automation tool, but there are difficulties to get ai models to output perfect information for logic applications -\u003e eg.\n\nprompt: button to send new note?\nmodel: New Notee\noops, there is typo --\u003e logic application fails, because it requires perfect information\n\nalthough this limitation can be solved with better, more accurate models\nstill there is limitation of application interfaces\nis web browser automation enough?\n\nto fully generalize automation tasks, lets think about solution\n2)\n\ncan we get ai to control input methods?\nlets think about controlling mouse.\n\nai model can visually see where is the button it needs to press, then it can see where the cursor is.\n\nwe could even feed information from visual model to object detection style model, which would output coordinates of ui element. however this all is probably solvable with single model.\n\nwe need to combine visual model with coordination model\nwe need to be able to \n\nprompt: \"where is the new note button\"\nresponse: x: 200, y: 900\n\nvisual model with combined coordination can be tasked for any ui automation task for any application\n\nstill what is left for really autonomous agent:\n- task flow control, how is the task proceeding?\n- goal control, what is the goal we want to accomplish? (generally this comes from human controlling the model, unless we want to build real agi)", "sig": "01653face6975836a8dead14741e9249bacfc8af31aab983c8f0479056758a8b9635296ef7de73b903a0342219f405eafd81da9fe1748e2bd2b3b62a046e40e6" }