It’s going to be very interesting to see the legalities of Google using the contents of 15·1 million websites for its C4 dataset, used to train large language models. Ton Zijlstra put me on to a Washington Post article that revealed which sites were used. He had discovered that his own website (zylstra.org) had provided […]
Read More… from Did Google use your website to train its language-learning model?