cri.dev
about posts rss

Improving my tags using OpenAI's Chat completions API

Published on

Here I want to share how I slightly improved the tags on each of my blog posts.

The idea is super simple, a script should do the following:

  • read each blog posts content
  • parse the front matter containing the tags (the header part at the beginning of markdown files)
  • create the prompt containing an excerpt of the blog post
  • call OpenAI’s API and parse the suggested tags
  • update the frontmatter with the tags
  • save the file back

Find the code and more details below.

The prompt

This is the prompt I crafted that returns acceptable tags:

Given a blog post, suggest appropriate tags for it.
The blog is about software programming.
Extract the 5 most important tags from the blog post.
The tags need to satisfy the following:
- short
- simple
- super generic
- one word (avoid putting words together)
- lowercase, no camelcase
- no duplicate concepts in the tags, e.g. githubcodespaces and github should become github and codespaces
- no numbers as tags
- when a tag gets too long, split it in two or generalize it
- avoid very specific tags
- do not use generic tags like "blog", "post", "programming", "software" etc. (if they're useful to categorize the post it's ok)
- the output should not contain any other extra text

Return just 5 tags in comma separated list form, avoid adding any other text before or after the tags.

Blog Post:
${blogPost.substring(0,2000)}...

It’s surely not perfect, but it works for now.

Processing the files

A handy find and xargs magic does the trick just fine.

E.g. finding all blog posts from 2023

find posts -name "*.md" | grep "2023-" | xargs -I {} ./scripts/improve-tags-gpt.mjs {}

The script

I am using gray-matter to easily parse the posts frontmatter.

Also openai’s npm package for ease of use.

You’ll just need to set the env variable OPENAI_API_KEY and you’re good to go.

The main part looks like this:

#!/usr/bin/env node

import fs from 'fs/promises'
import OpenAI from 'openai'
import matter from 'gray-matter'

const ai = new OpenAI()
const filepath = process.argv[2]
processMarkdownFile(filepath)

Processing the markdown

As simple as reading the file, parsing the frontmatter, skipping drafts and finally finding tags for the given blog post:

async function processMarkdownFile(filePath) {
  try {
    const fileContent = await fs.readFile(filePath, 'utf-8')

    const { content, data } = matter(fileContent)
    if (data.tags?.includes('draft')) return

    const suggestedTags = await findTagsForBlogPost(content)

    data.tags = ['post'].concat(suggestedTags)

    const updatedContent = matter.stringify(content, data)

    await fs.writeFile(filePath, updatedContent, 'utf-8')

    console.log(`Updated tags for ${filePath}`)
  } catch (error) {
    console.error(`Error processing ${filePath}:`, error)
  }
}

Prompting the model

You’ll just need to call ai.chat.completions.create, specify the message and model to use.

Then, given the response, you can cleanup and parse the returned tags.

async function findTagsForBlogPost(blogPost) {
  const prompt = `
    Given a blog post, suggest appropriate tags for it.
    The blog is about software programming.
    Extract the 5 most important tags from the blog post.
    The tags need to satisfy the following:
    - short
    - simple
    - super generic
    - one word (avoid putting words together)
    - lowercase, no camelcase
    - no duplicate concepts in the tags, e.g. githubcodespaces and github should become github and codespaces
    - no numbers as tags
    - when a tag gets too long, split it in two or generalize it
    - avoid very specific tags
    - do not use generic tags like "blog", "post", "programming", "software" etc. (if they're useful to categorize the post it's ok)
    - the output should not contain any other extra text
    
    Return just 5 tags in comma separated list form, avoid adding any other text before or after the tags.
    
    Blog Post:
    ${blogPost.substring(0,2000)}...
    `

  const completion = await ai.chat.completions.create({
    messages: [{ role: 'user', content: prompt }],
    model: 'gpt-3.5-turbo',
  })

  const suggestedTags = completion.choices[0].message.content.trim().replace(/Tags\:/gi,'')
  const tags = suggestedTags.split(',').reduce((acc, tag) => acc.concat(tag.trim().split(' ')),[]).filter((_,i) => i < 10)
  console.log(tags)
  return tags
}

Dogfooding

I obviously ran the script on this very blog post and the suggested tags were the following:

- post
- software
- programming
- blog
- script
- openai

Works for me.

Conclusion

A more sensible approach would be to somehow feed (e.g. in the prompt) the most used tags among the other posts.

I am not 100% fond of the categorization the model returns, but it’s a little better than what I did manually over the years.

It would probably work better with a classic NLP solution, but for fun and convenience this is done with the model API.

PS: It cost about 0.5$ to process ~300 blog posts (a few times).

Here, have a slice of pizza 🍕