By David Stephen
Sedona, Az –AI tools—offering a range of audio, video, image and text services—indexed across search engines could have their tokens scraped daily, to extract the provenance for some of their current misuses.
Several AI models store prompts, others store conversation history, used for further fine-tuning. They provide a traceable opportunity to find sources of deepfakes and misinformation. They also provide a chance to have AI tools take better precautions.
In a recent feature by the Associated Press, Tests find AI tools readily create election lies from the voices of well-known political leaders, discussing a report, “Researchers used the online analytics tool Semrush to identify the six publicly available AI voice-cloning tools with the most monthly organic web traffic: ElevenLabs, Speechify, PlayHT, Descript, Invideo AI and Veed. Next, they submitted real audio clips of the politicians speaking. They prompted the tools to impersonate the politicians’ voices making five baseless statements. Some of the tools — Descript, Invideo AI and Veed — require users to upload a unique audio sample before cloning a voice, a safeguard to prevent people from cloning a voice that isn’t their own. Yet the researchers found that barrier could be easily circumvented by generating a unique sample using a different AI voice cloning tool. In a total of 240 tests, the tools generated convincing voice clones in 193 cases, or 80% of the time. None of the AI voice cloning tools had sufficient safety measures to prevent the cloning of politicians’ voices or the production of election disinformation.”
Prompts and answers on these AI tools and others—interested in remaining indexed by major search engines—may at least have an equivalent of a robot.txt file or a data API, so as to organize conversation tokens and embeddings for availability, to ensure that wherever whatever is coming from can be found, as well as noting what was done within a day and how to track or expect how it might be misused. Some of the open source Large Language Models [LLMs] where many of the tools are built can have their base or assistant models allow, say __getitem__(token: info), so that harm from voice cloning or impersonation, as well as generated videos, texts and images, can be traced and detained.
There are several approaches to AI safety and alignment, but search engines have some leverage in getting data from crawled AI tools
Recently, the UK AI safety institute announced a new AI safety evaluations platform, stating that “Inspect is a software library which enables testers – from start ups, academia and AI developers to international governments – to assess specific capabilities of individual models and then produce a score based on their results. Inspect can be used to evaluate models in a range of areas, including their core knowledge, ability to reason, and autonomous capabilities.”
The National Institute of Standards and Technology (NIST) also announced “a new testing, evaluation, validation and verification (TEVV) program intended to help improve understanding of artificial intelligence’s capabilities and impacts. Assessing Risks and Impacts of AI (ARIA) aims to help organizations and individuals determine whether a given AI technology will be valid, reliable, safe, secure, private and fair once deployed. ARIA will help assess those risks and impacts by developing a new set of methodologies and metrics for quantifying how well a system maintains safe functionality within societal contexts.”
These efforts, which appear to be voluntary, can be used to go after AI tools—and their outputs—wherever they are available. There have been several reports of troubling impersonations that were costly to families, in which some of the tools were prompted by keywords that can easily be extracted (using, say query vectors), including for those that split their methods, doing some in one tool and others in the next.
Already, several AI tools have their own safety efforts, some are halting and reporting misuses, but this may not be enough, as new methods are found all the time. Jailbreaks and prompt injection attacks are often explored. Some only get blocked when made public, or after usage for unwanted stuff. It could be necessary to scrape AI tools and stack paths, beginning with those that want to remain found on search engines. With a database of embeddings, it is possible to use vector dimensions to seek outputs that came from non-indexed pages as well.
The research may begin with how to organize this standard, within jurisdictions or beyond, especially by extended approaches while crosslinking with other sources, using deepfakes as the benchmark. The European Union has an AI Regulation Act. Colorado has an AI regulation law. The Senate has a roadmap for AI bills. Some are cautioning against over-regulation.
There is a recent letter, A Right to Warn about Advanced Artificial Intelligence, stating that, “AI companies possess substantial non-public information about the capabilities and limitations of their systems, the adequacy of their protective measures, and the risk levels of different kinds of harm. So long as there is no effective government oversight of these corporations, current and former employees are among the few people who can hold them accountable to the public.”
In many cases, it may not be possible to get exact tokens, but summaries could be provided, or averages, especially along keywords and parsed. Data from these sources may also be used to explore how to develop matrix rows for intent, against misuse.
This may apply to AI tools on the app, play and other stores. Some internet service providers may allow for ways to have this running. Hosting services for domains too, to ensure that they can be scraped, or that outputs seen, or to be downloaded are not misleading. It may also apply to internet forums and social media, going beyond AI agents or bots product, by AI firms.
The tech may find usefulness in fighting hallucination-confabulation, for tools that agree on collaboration within the standard. This approach can seek precision for sources, while allowing freedom for tools against emerging risks, without stifling innovation or excluding the human mind.