Could we create a "virgin" LLM?

One that isn't trained on any "stolen" work and specific to fiction novels?

Jun 25, 2024

A lot of writers are angry at LLMs because “they have stolen fellow writers’ work” (and probably their own too).

One solution to that would be to create a completely new LLM, separate from Claude and ChatGPT and Mistral, etc that would be trained from scratch only on authors’ works who have opted in, who have consented, and who have purposefully handed over their work—and with any luck, would get paid for their work.

That’s what I set out to explore a couple months ago. I wanted to build it and heal both sides of the “AI” debate.

I was a little hell-bent on this idea and I disregarded some advice that it wouldn’t be worth it. I was going to make it happen, I told myself.

Well, I’m here to tell you today that it ain’t happening, sorry.

The shear amount of data and the cost to train it are the two limiting factors. Let me explain.

I’ve talked to a couple experts and I’ve used some tools to estimate how many books we would need, and it could be in the hundreds of thousands of books range.

It’s possible that it’s only in the tens of thousands of books range, but even at that scale … … wow that’s just a lot of people to convince that this is a good idea and to hand over their work. It would be a whole separate team and business just to manage that.

And then there’s the cost.

The GPU time that’s needed to train a brand new model from the ground up is on the order of $150k-200k on the conservative, low-end estimate 🤯

That gives me a whole new appreciation for what Anthropic, OpenAI, etc have built and puts them in a totally different league from what I could do.

I definitely don’t have either the money, nor the influence to get all the data that would be needed.

So I have to give up this project idea.

It’s sad.

I thought it had so much potential.

But ya know, I’ve made peace with it.

I don’t think that the LLMs available today—and this might make some people angry—are the enemy that we should fight, or even the competitor that we should try to beat.

For one thing, the courts have deemed how they used people’s works to be “fair use.”

And secondly, a close writer friend of mine had his work truly stolen by some pirates a couple weeks ago and because of the way it was handled with Amazon, that whole pen name is completely shut down, cut off, dead.

It was a significant chunk of his writing income … and in a few hours it evaporated into thin, Amazon air.

So Amazon is the real enemy … no just kidding!! 😁

My point is that there are people who truly steal authors’ work, but I don’t think LLMs are one of them, and even though part of me still thinks authors should be paid for how their work trains LLMs, that’s not going to happen for now and I just have to live with that.

Troy Benton

There are things within our power to change and things beyond it. I appreciate your effort to explore potential solutions in this field. Who knows what the future holds? Perhaps cheaper hardware will make this feasible one day.

For now, we must make the best choices with what's in front of us and continue doing what we can. This means adapting to the current landscape, supporting each other as a community, and finding new ways to thrive despite the challenges. No matter what side of the AI/GI fence your own - It’s about resilience and innovation, and I believe writers are a resilient bunch.

Expand full comment

1 reply by Cameron Sutter

Steve Shipley

Jul 23, 2024Edited

Agree with your and Troy's comments, and the LLMs are not doing anything that humans have not done when learning through history. The LLMs just learn and are trained much faster than we have been trained previously. And the LLMs are not priating books like human pirates do and try to sell them.

2 more comments...

Real Human Writers

Discussion about this post