- on Nostr: how to fully automatize any ui task? there are many methods. 1) visual ai can locate ...
how to fully automatize any ui task? there are many methods.
1)
visual ai can locate ui elements, it could detect which button to press
it could feed this information to eg web browser automation tool, but there are difficulties to get ai models to output perfect information for logic applications -> eg.
prompt: button to send new note?
model: New Notee
oops, there is typo --> logic application fails, because it requires perfect information
although this limitation can be solved with better, more accurate models
still there is limitation of application interfaces
is web browser automation enough?
to fully generalize automation tasks, lets think about solution
2)
can we get ai to control input methods?
lets think about controlling mouse.
ai model can visually see where is the button it needs to press, then it can see where the cursor is.
we could even feed information from visual model to object detection style model, which would output coordinates of ui element. however this all is probably solvable with single model.
we need to combine visual model with coordination model
we need to be able to
prompt: "where is the new note button"
response: x: 200, y: 900
visual model with combined coordination can be tasked for any ui automation task for any application
still what is left for really autonomous agent:
- task flow control, how is the task proceeding?
- goal control, what is the goal we want to accomplish? (generally this comes from human controlling the model, unless we want to build real agi)
1)
visual ai can locate ui elements, it could detect which button to press
it could feed this information to eg web browser automation tool, but there are difficulties to get ai models to output perfect information for logic applications -> eg.
prompt: button to send new note?
model: New Notee
oops, there is typo --> logic application fails, because it requires perfect information
although this limitation can be solved with better, more accurate models
still there is limitation of application interfaces
is web browser automation enough?
to fully generalize automation tasks, lets think about solution
2)
can we get ai to control input methods?
lets think about controlling mouse.
ai model can visually see where is the button it needs to press, then it can see where the cursor is.
we could even feed information from visual model to object detection style model, which would output coordinates of ui element. however this all is probably solvable with single model.
we need to combine visual model with coordination model
we need to be able to
prompt: "where is the new note button"
response: x: 200, y: 900
visual model with combined coordination can be tasked for any ui automation task for any application
still what is left for really autonomous agent:
- task flow control, how is the task proceeding?
- goal control, what is the goal we want to accomplish? (generally this comes from human controlling the model, unless we want to build real agi)