It’s no secret that data is the lifeblood of AI models. However, the data used to train models can often be an afterthought to the model’s framework. This can no longer afford to be the case as AI becomes more pervasive and influential in consequential decision-making situations.

The importance of quality data is illustrated by the valuation of data labeling start-up Scale AI, which recently reached $7.3bn and has more than doubled in the space of just over four months. Scale AI’s purpose is to streamline and – as its name suggests – scale the often data-hungry machine learning (ML) development cycle for its customers by means of a platform providing labeled and annotated training data.

The company’s CEO Alexandr Wang often uses the line ‘data is the new code’ to encapsulate its value proposition to help customers customers utilize and manipulate data just as they have done with code in the past. It’s no surprise, therefore, that this paradigm shift brings with it increased scrutiny of the data used to train AI models.

Data cannot simply be a means to a modeling end

Garbage in, garbage out is a longstanding modeling principle. However, with the advent of modern AI and, concurrently, the massive amounts of information for ML models to process, it is harder to discern the logic behind poor outputs. Using publicly generated datasets to train algorithms may result in ML models replicating biases which are inherent in the data itself. It’s not the model frameworks but the data that trains the frameworks which are coming under greater scrutiny.

There has been no shortage of controversial stories in recent months which have focused on AI bias. In the realm of large language modeling, OpenAI’s GPT-3 has been observed to exhibit racial bias. Similarly, Twitter’s image recognition model showed bias when cropping images for preview purposes. Perhaps the most damaging (and ongoing) issue is Google’s dismissal of a high-profile AI ethics researcher who questioned the purpose of and lack of accountability behind the kinds of large language models that Google develops and deploys.

What seems like a revolving door of issues has brought AI ethics firmly into the public domain, and the public appears to be growing wary of black box-like, ‘end-to-end’ uses of AI. Another reason for the increase in public consciousness of AI is that the technology is starting to have a real impact in areas such as recruitment and consumer finance, which are more tangible than further-off developments like autonomous vehicles (AVs).

Explainable AI is an important but fundamentally reactive tool for responsible AI

Explainable AI (XAI), whose purpose is to allow humans to understand and follow the path a model took to make a decision, is becoming a trend in situations where important yet frequent decisions are automated. Though this may help build trust with consumers, XAI is an essentially reactive approach to modeling, lying at the end of or outside the ML development cycle.

Outsourcing to Scale AI or the growing number of start-ups offering similar services could be a way to improve the clarity of the model development process. Scale AI’s Nucleus platform, for example, provides means of enabling greater transparency in the ML development cycle, such as training data slicing and model debugging capabilities. Other start-ups like AI.Reverie are also concentrating on synthetic generation of data using augmentation techniques, with an overt focus on improving algorithm generalization and therefore reduction of bias.

The underlying message that AI companies need to heed is an obvious one, but one that can’t be overstated: the data used to train model frameworks is just as significant as the frameworks themselves. Companies must also think very carefully about the purpose and scope of their models. Data cannot afford to be a secondary consideration in the development and deployment of AI.