Data cannot be a realm of dominance

Posted On: November 2, 2024
Posted By: Arijit Goswami
Comments: 0

AI needs data to learn, grow and deliver. It is data that breathes life into the AI models and determines how the models will make decisions when faced with different situations. Therefore, whoever owns the dataset gets to decide how the models will perform and what output they will lead to. Since getting hands on clean and useful data is difficult and requires monetary investments, it is often the big firms and their partners that hold authority over the data that’s used to train their products. The exceptional volumes of data required to train generative AI models is not an easy-to-find-or-manage resource. Time, money and lots of labour goes into building datasets which AI companies can rely on to build their products. And therefore, the incumbent large fishes in the ocean of technology industry are seeing an entrenchment of their dominance, leading to an increasing tilt of advantage in favour of the large companies.

How large firms benefit disproportionally?

Colossuses in the technology industry have the deep pockets that are required for acquiring and managing vast troves of data that help to develop, train and calibrate the next-generation AI models. The depth of pockets also matters when it comes to fending off lawsuits against unethical use of AI models and indemnifying the users of their products in case of intellectual property claims. The several anticompetitive practices that these companies use also add to the concern of new entrants. Large companies have already scraped off humongous volumes of data and have created proprietary datasets, which small emergent companies do not readily have access to.

Acquiring such large datasets from large companies for training new products, or building such large datasets from scratch take time, money and labour – which again increase the barriers of entry for the startups in the AI industry.

Yes, it is true that simply scraping off internet data – some of which happens to be under copyright, or under Creative Commons license – is unethical and lawsuits have been filed against big firms that have been involved in such activities. Yet, many lawsuits are still awaiting verdicts, while big firms are cementing their monopoly over data, making it increasingly difficult for small companies to enter the fray.

Will meaningful data grow scarcer?

It seems so. With the emergence and prevalence of generative AI tools, the proportion of synthetic data is increasing worldwide. Human-generated text is anyways getting overshadowed by the enormous amount of synthetically generated data, which makes human generated meaningful data scarcer.

What’s more unsettling is the fact that several websites have put up paywalls on well-drafted content and blockers against web crawlers, making it more challenging for small entities to acquire high quality data online.

Since compute efficiency of AI models is a function of data quality, the decline in human-generated content will limit access to quality data, leading to degradation of AI models. Unless new AI models are not fed with high quality data, their training quality will remain subpar, thereby severely affecting the efficacy of their products built on such models.

Will this lead to a monopoly?

Since the large incumbents in the AI space sit on ginormous treasures of high-quality data, their products are bound to be better trained and optimized. This can create a positive perception in the minds of users, as they find the products of smaller companies ineffective due to subpar training of their AI models. As the data monopoly entrenches deeper, the quality gap between the products of large firms and small entities may widen, leading to a stage where large companies hold disproportionate market dominance at the cost of profitability of small firms, eventually chasing them out of business.

With the decline in publicly available data, companies are realizing the need to build proprietary datasets for training their AI models.

Building meaningful training data requires gigantic computing capacities, resources for scraping and labelling data, and access to rare computing microchips – which place the large companies with big budgets in a favourable position. Moreover, the better a company’s platform is, the stronger are the network effects on that platform, which leads to even greater volumes of data for the incumbent. For example, Google owns YouTube which is one of the largest reservoirs of video data, while Instagram has perpetual flow of imagery and X has oceans after oceans of text data. Competing against such leviathans of data is pretty unimaginable, if not impossible. Microsoft, Google, Meta and X hold unthinkable amount of data that can be replicated only by using supernatural levels of power.

How are the big firms behaving?

Large firms are growing progressively protective of their data troves. For instance, Microsoft faced a lawsuit by X for alleged illegal use of X’s data. YouTube recently warned Anthropic, Applie, Nvidia and OpenAI against unauthorized use of its videos and transcripts for training AI models. European Union (EU) has passed data privacy laws that restrict large companies from scraping content of users in its region, yet these laws happen to be limited to EU’s jurisdiction and does not necessarily apply retroactively.

While building partnerships with third-party entities for the latter’s proprietary data, large companies sign deals with clauses about cloud exclusivity, integrations and API credits – as stated by Courtney Radsch for Tech Policy Press – which small companies are unable to do, thereby limiting their bargaining power. With access to proprietary data and third-party data, large companies easily sideline the small companies in the race to acquire meaningful data.

Can synthetic data help anyhow?

Perhaps, it can. It depends on the quality of the synthetic data. Experts believe that synthetic data can mitigate the shortage of training data and protect the privacy of users. Self-improvement solutions in reinforcement learning can help AI models to pick up synthetic data to produce and improve their output using reinforcement learning without the need for human feedback on the results.

What policies can help small companies Tomorrow?

Policymakers need to be on their toes to spot any kind of anticompetitive practice in the technology industry and mitigate any risk posed to small entities by the large incumbents. Any privacy violation or attempt to establish dominance by any company must be countered by competition laws. Companies that violate antitrust laws or use compensated copyrighted materials to gain undue market share must be probed and data supply chains must be always monitored by policymakers. Finally, any jurisdiction lacking data privacy laws must take cues from General Data Protection Regulation (GDPR) and create a business landscape which is inclusive for small entrants and large incumbents, regardless of their data prowess.

Data cannot be a realm of dominance