LatAmGPT Aims To Create AI That Better Represents The Region's Diversity

Latin America has been the cradle of now globally popular literary and musical genres, staple foods like the potato and the inspiration behind the well-known Happy Meal . It could also become the cradle of a new form of AI.

A coalition of research institutions is working on what they call LatAmGPT — a tool that can take into account the region’s language variances, cultural experiences and “idiosyncrasies.”

The aim is to offer users a more faithful peek into and representation of the Americas and the Caribbean than that of large language models (LLMs) that have mostly come from U.S. or Chinese companies and were largely trained in English.

“We want to develop our capabilities, find local AI-based solutions and create a better understanding of these tools in Latin America and about Latin America,” said Rodrigo Durán Rojas, director of Chile’s National Center for Artificial Intelligence, which is coordinating the effort.

Durán Rojas said that for general purposes, the project will be hard pressed to compete with “state of the art models with multimillion budgets,” but that “what our model can offer that others don’t is a much richer and representative outlook of Latin America and the Caribbean,” its people and its outputs.

For example, Durán Rojas said initial testing has shown LatAmGPT to have far better results when queried about South American history, and that the same is expected for when the LLM is asked to, say, write a poem in the style of local authors or provide an overview of regional education policy.

There are more than 30 institutions involved in developing LatAmGPT from countries across the hemisphere, and collaborators include Latinos in the U.S. such as Freddy Vilches Meneses, an associate professor of Hispanic studies at Lewis & Clark College in Oregon. This, he said, is in recognition of how “Latino and Latin American experiences are a cultural fellowship that goes beyond geography.”

“There are elements of Latin America in Oregon, in California, in Texas,” Vilches Meneses said. “We want to make sure to incorporate that Latino experience as well.”

LatAmGPT, which aims to launch its first publicly available version around June, was announced last month on the heels of a regional commitment made during a summit on artificial intelligence in Uruguay to focus on “ethical, inclusive and beneficial” technological development to “promote and protect human rights” and explore the best possible public policies for AI governance.

That impulse follows an increasing uptake in the region of technological advances such as the use of drones to monitor deforestation in the Amazon rainforest, the development of apps to encourage more people to continue learning Indigenous languages, the creation of algorithms to aid in the search for forcibly disappeared people or the adoption of blockchain mechanisms to preserve historical documents of past dictatorship’s actions.

Some of those preserved documents are now being used as sources to train LatAmGPT, along with papers, records and logs that institutions such as libraries and national archives have made available specifically for the project. Durán Rojas said this gives the model more nuance and localized breadth than the general internet data scraping other systems tend to use.

“LatAmGPT will have more context than the other model languages and should therefore hallucinate far less” when it comes to its use cases, Durán Rojas said. Hallucination is what AI researchers call when a model seemingly makes up an answer that’s incorrect or false though it’s presented as factual.

So far the project’s dataset has more than 8 terabytes of information so the model can run on about 55 billion parameters (the variables with which an LLM makes a prediction output, like neurons that synapse or connect in a human brain). Durán Rojas said that’s somewhat close to what the first public version of ChatGPT had when OpenAI launched it in the fall of 2022.

The challenges of diverse dialects and complex grammar

ChatGPT and other models like Google’s Gemini have also sought in recent years to include a wider scope of data to offer the programs in languages other than English and with “localizations”— such as the LLM knowing to respond in the metric system when relevant or to understand idioms.

Those companies acknowledge the importance of expanding that offering. HyunJeong Choe, the director of engineering and internationalization for Google’s Gemini Apps, said it’s “a dedicated experience” that can be “essential for cultural relevancy and sensitivity.”

But they also recognize it’s a particularly complex endeavor, since most training data available to them is in English. “The intricacies of different languages can pose a significant obstacle for all AI models. … Languages with complex grammar, diverse dialects or limited digital resources may be harder to train,” Choe said.

LatAmGPT, through its institutional networks with libraries and archives, has somewhat skirted this issue — but not entirely. Durán Rojas said they’re still struggling to incorporate Indigenous languages spoken by millions in the region because written documentation is not as widely available.

But they’re still aiming to try as they continuously perfect their model — though they stress the importance of collaboration.

“The quality and attributes of the results we can get will depend on us as Latin Americans joining in to contribute as much as we can,” said Vilches Meneses, the Lewis & Clark professor.

Currently, with the tentative June launch date, LatAmGPT is still receiving data as collaborators regularly check in with specific questions to benchmark it in comparison to other available models.

Among the questions they’re testing are queries on the many different names and terms used in the region for a specific word like “car,” or a request for the GPT to make a comparison chart of how the region’s countries have responded to mass immigration from places like Venezuela.

A large goal of LatAmGPT is to become familiar with these technological advances so they can be included in public policies and regulations, according to Durán Rojas.

For that, the creation of the transcontinental network to help develop the project is key, and per Durán Rojas will likely remain so.

“The most meaningful aspect, the greatest legacy, is this interconnectedness we’ve found to strengthen and develop AI-based solutions,” he says. “The model, I mean it’s great that we’re making it, but the collaboration — that’s what will most impact how we build things going forward.”

And with that there is a growing opportunity to offer further contributions with a Latino touch.

“At its base, this is jointly creating something from Latin America for Latin America and for the world, as proof to ourselves and to others that we can also produce high tech,” Vilches Meneses said, “and that we can contribute to knowledge of artificial intelligence while still employing our social and cultural intelligence.”

An earlier version of this story was first published by Noticias Telemundo.

Source link