About Motivatron 5000
Motivatron 5000 is an AI project made by Jakob Anderson, and sprung forth as a variant of his fiction generation project: Eclectic Beams.
Idea
I saw motivational text art on pinterest, you know the ones. I wondered if I could automate their creation, using my fiction generation workflow from Eclectic Beams as a start. I'd have to first create text phrases along their style.
Process
Dataset
I hand-typed and scraped around 1000 motivational phrases and proverbs from these pinterest motivational quote images, and other sources across the interwebs, and put them into a text file, separated by line returns. I then put a prefix and suffix delimeter around each line, so I could tell my generator where to start and stop with these shorter phrases, instead of the open-ended text that the fiction generation required.
Training the model
I fed that dataset into a google colab GPU notebook to fine-tune a GPT-2 124M model with the phrases. I trained it for a week, off and on. It gave me some good sample output after around 10,000 iterations, so I started training a larger, more complex GPT-2 774M model. Over about a month, I got it up to 200,000 iterations, and it was making pretty good sample outputs at default temperature settings. It had reached a loss average of 0.01 around 90,000 iterations, but I badly needed to rewrite my plagiarism filter for this and the rest of my text generation, so I kept it training every day as I did that.
Generating text
I have a python script that makes random settings, in a safe range for the generator's temperature, top_k, and top_p params. I made a simple bash script that ran this script in a loop, hundreds of times per day to generate samples. I examined these samples, and found some tighter ranges for these params to generate randomly from, and continued. Meanwhile, my rewritten plagiarism filter was completed.
Profanity filtering
Because the models, and GPT-2 are black boxes, I've had some issues with it generating some explicit texts, which I'd rather not deal with, so I used a simple ML profanity filter model at the end of the generator, to weed out unsightly phrasing. Perhaps later, I'll just flag these for a special mode, so the grownups can laugh at them, away from any child's eyes.
Plagiarism filtering
The odds of this model generating phrases that were already contained in the dataset was pretty high for these short text lengths. about 90% of phrases were exact copies, while using a 0.7 temperature, and not much better with a higher, more insane temperature. When the model generated new and intriguing phrases, they always felt like a copy, so I had to be sure I could automate how to check for this, since pasting each into "Find" in my editor against the dataset files was very dull and time-consuming. I built a rust framework around frizensami's plagiarism-basic library, so I could plagiarism-detect quickly in-memory with rust code, instead of by passing a directory to his cli. It uses sliding-windows of configurable text lengths. I filter my shorter texts with a window length of 3-5 words, and my longer texts with a window length of 5-10 words.
Future Plans
Writing this section is always a project's curse, but I plan to generate social-shareable graphics with these phrases, to get closer to the style of those that moms share on pinterest and instagram, and give them canonical urls here, so they can be deep-linked back to once randomly presented once. I'll rewrite the way I flatten the output data, so that I can forward all the generation params, and some filtering tags to the UI in future, so users can fine-tune the types of motivation they receive. This might be a good project to experiment with a privacy-first "no-knowledge" method I've thought up for a suggestion engine with some ml stuffs where users could help improve the model for themselves and others through the way they interact with the phrases they like most.