LIAutomator
More human-like automation
To be specfic: Proof-of-concept AI agent to control my computer and logically traverse web UIs like humans in order to automate nontrivial tasks and workflows on web apps.
Demo
Short-Form Demo (recommended)
Full-Length Realtime Raw Demo
How It Works
- The AI Agent reads through a profile and extracts relevant info using a multimodal LLM.
- Utilizing a research technique called Set-of-Marks prompting, the multimodal LLM clicks on website buttons by identifying the lettered encoding next to the button to-be-clicked .
- The AI Agent executes keyboard presses of the lettered encoding to trigger the click on the corresponding button. This allows the AI agent to click on buttons without having to move the mouse at all!
- The AI agent then generates, types out, and sends a personalized message based on the extracted profile info.
- Once sent, the Agent finds a new person and repeats the entire process
[!NOTE] Notice how this process is similar to what a human would when accomplishing a task using a GUI
Features
- Works out-of-the-box!
- Includes GUI to start/stop automation and show real-time metrics (like # of LLM requests)
- It’s TOTALLY FREE! $0 Cloud costs for AI APIs (must have TOGETHER API KEY)
- No hardcoding of GUI element positions! So this application can run independent of the application window’s size or location on your computer. See the FAQs for more info.
- In-built Robustness:
- Error-handling and backtracking of states to ensure that automation will work given 3 tries.
- Can configure redundunacy option to increase number of Vision Models, reducing any possible hallucinations (or unnecssary creativity)
- FAST! This application is multi-threaded.
- Audio updates regarding state and AI agent current action (only for MacOS currently)
- Can be configured to include human input/directions in the automation process (using Whisper transcription of your voice)
Features for Developers
- Code is written following OOP principles, allowing for modularity and reuse of components to create new automations
- Framework included to abstract away computer control functions like scrolling, clearing/inputting txt, etc.
- Logging enabled for easier debugging; automation metrics and configuration are collected/saved in a .csv file
Future
- Open source the src code if there seems to be interest!
- Benchmark against Claude Computer Use and ChromeGPT
- Make it faster
- Do Large Scale Experiments
- Maybe make a Docker?
- Experiment with other models (perhaps finetuned models), better prompt engineering, better hyperparameter tuning
- Create a higher-level of abstraction or harness (aka set of functions with simple APIs) to allow AI to self-generate actions given an high-level task description. In this way, you don’t need to preconfigure the workflow (the AI trys to learn the proper workflow).
- Has some overtones of Reinforcement Learning tbh
Lessons Learned
- Should have programmed everything in OOPs in the beginning and planned more
- Will update more later
FAQ
- Why not just use Selenium since this is just browser automation?
- Good question! First, Selenium is source-code dependent automation. You need to read the underlying HTML of the webpage, understand how to grab the elements you want, and then code it up. Thus, Selenium is not really accessible for non-coders and even if you can code, it can be too “in-the-weeds”. Second, the whole purpose of websites, browsers, and GUIs is to help humans naturally interact with software. So, let’s automate GUIs in the way they were meant to be used. Third, Selenium is not scalable. You can’t really create high-level primitives and then ask AI to use those high level primitives to create an list of actions to execute a task. I mean, maybe you can, but what I mean is you want the possible GUI-based automation actions (read, scroll, find_click_button) to be as generalizable as possible (aka not application specific). This can then allow for a more wider variety of automation applications while still using the same GUI-based automation actions (platform-independent, or in this case, website-independent). Overall, Selenium is Browser Automation while this AI Agent is Browser Automation on Steroids.
- Why not just use an Linkedin API to read profile info and send connection requests. You should be able to do this with anti-bot backend mechanisms since you’re not using a browser.
- Please see the above question. But to answer breifly, there might be unnecssary problems: rate-limits, anti-bot detection mechanisms since you’re not using a browser, perhaps too much time to reverse engineer their API, no GUI/visual confirmations, not scalable, not understandable/accesbile to non-programmers, etc.