Here I want to share how I slightly improved the tags on each of my blog posts.
The idea is super simple, a script should do the following:
- read each blog posts content
- parse the front matter containing the tags (the header part at the beginning of markdown files)
- create the prompt containing an excerpt of the blog post
- call OpenAI’s API and parse the suggested tags
- update the frontmatter with the tags
- save the file back
Find the code and more details below.
–
The prompt
This is the prompt I crafted that returns acceptable tags:
Given a blog post, suggest appropriate tags for it.
The blog is about software programming.
Extract the 5 most important tags from the blog post.
The tags need to satisfy the following:
- short
- simple
- super generic
- one word (avoid putting words together)
- lowercase, no camelcase
- no duplicate concepts in the tags, e.g. githubcodespaces and github should become github and codespaces
- no numbers as tags
- when a tag gets too long, split it in two or generalize it
- avoid very specific tags
- do not use generic tags like "blog", "post", "programming", "software" etc. (if they're useful to categorize the post it's ok)
- the output should not contain any other extra text
Return just 5 tags in comma separated list form, avoid adding any other text before or after the tags.
Blog Post:
${blogPost.substring(0,2000)}...
It’s surely not perfect, but it works for now.
Processing the files
A handy find
and xargs
magic does the trick just fine.
E.g. finding all blog posts from 2023
find posts -name "*.md" | grep "2023-" | xargs -I {} ./scripts/improve-tags-gpt.mjs {}
The script
I am using gray-matter
to easily parse the posts frontmatter.
Also openai
’s npm package for ease of use.
You’ll just need to set the env variable OPENAI_API_KEY and you’re good to go.
The main part looks like this:
#!/usr/bin/env node
import fs from 'fs/promises'
import OpenAI from 'openai'
import matter from 'gray-matter'
const ai = new OpenAI()
const filepath = process.argv[2]
processMarkdownFile(filepath)
Processing the markdown
As simple as reading the file, parsing the frontmatter, skipping drafts and finally finding tags for the given blog post:
async function processMarkdownFile(filePath) {
try {
const fileContent = await fs.readFile(filePath, 'utf-8')
const { content, data } = matter(fileContent)
if (data.tags?.includes('draft')) return
const suggestedTags = await findTagsForBlogPost(content)
data.tags = ['post'].concat(suggestedTags)
const updatedContent = matter.stringify(content, data)
await fs.writeFile(filePath, updatedContent, 'utf-8')
console.log(`Updated tags for ${filePath}`)
} catch (error) {
console.error(`Error processing ${filePath}:`, error)
}
}
Prompting the model
You’ll just need to call ai.chat.completions.create
, specify the message and model to use.
Then, given the response, you can cleanup and parse the returned tags.
async function findTagsForBlogPost(blogPost) {
const prompt = `
Given a blog post, suggest appropriate tags for it.
The blog is about software programming.
Extract the 5 most important tags from the blog post.
The tags need to satisfy the following:
- short
- simple
- super generic
- one word (avoid putting words together)
- lowercase, no camelcase
- no duplicate concepts in the tags, e.g. githubcodespaces and github should become github and codespaces
- no numbers as tags
- when a tag gets too long, split it in two or generalize it
- avoid very specific tags
- do not use generic tags like "blog", "post", "programming", "software" etc. (if they're useful to categorize the post it's ok)
- the output should not contain any other extra text
Return just 5 tags in comma separated list form, avoid adding any other text before or after the tags.
Blog Post:
${blogPost.substring(0,2000)}...
`
const completion = await ai.chat.completions.create({
messages: [{ role: 'user', content: prompt }],
model: 'gpt-3.5-turbo',
})
const suggestedTags = completion.choices[0].message.content.trim().replace(/Tags\:/gi,'')
const tags = suggestedTags.split(',').reduce((acc, tag) => acc.concat(tag.trim().split(' ')),[]).filter((_,i) => i < 10)
console.log(tags)
return tags
}
Dogfooding
I obviously ran the script on this very blog post and the suggested tags were the following:
- post
- software
- programming
- blog
- script
- openai
Works for me.
Conclusion
A more sensible approach would be to somehow feed (e.g. in the prompt) the most used tags among the other posts.
I am not 100% fond of the categorization the model returns, but it’s a little better than what I did manually over the years.
It would probably work better with a classic NLP solution, but for fun and convenience this is done with the model API.
PS: It cost about 0.5$ to process ~300 blog posts (a few times).