Thursday, April 11, 2024

Me and ChromaDB - A Series! Let's Create Our First Vector Database with Cosine Similarity

Hi Friends,

I got PIP to install!

I've been doing a bunch of research and wanted to give credit to some great blogs!!


Hasini Madushika's blog:

https://medium.com/@hasinivijayarathna/creating-a-vector-database-using-chroma-956b1d84aca3

Fantastic overview on how to setup your first ChromaDB and create a searcheable index of books and authors.


Michael Wornow's blog:

https://michaelwornow.net/2023/12/31/chromadb-demo

Great ChromadB overview with pros and cons as well as a great section on cosine similarity vs. distance.


Milana Shkhanukova's blog:

https://medium.com/@milana.shxanukova15/cosine-distance-and-cosine-similarity-a5da0e4d9ded

Fantastic job explaining in more detail what is cosine distance and how it's different from similarity.


Harrison Hoffman's blog:

https://realpython.com/chromadb-vector-database/

Another great blog on ChromaDB foundations as well as lots of information on vector similarity.


Who knew physics would actually be useful?!  Yeah yeah, Isaac Newton did in 1687.

Now let's get rolling!

1. Let's make sure Python3 is working:

#python3 -V

Python 3.10.12

WOOT!  And just to let you know, I'm using Ubuntu 22.04.2.

2. And to install the ever illusive PIP.

#sudo apt install python3-pip -y

#pip -V

pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

ALRIGHT!

Now let's create a vector database!











1.  The first thing I'm going to do is create a place to store my data.

mkdir /data

I did all this in a bash shell, but you can create a Python script and run it in there too if you'd like.

2. Run Python so you can run the code.

#python3

3. Import ChromaDB into Python.

import chromadb

Now here comes the fun part.  Do you want your database to only run memory resident or do you want it to save some place?  Kinda depends on your needs, the space you have, etc.  But I don't want all my work to go poof, so I'm going to save it to disk.

4. Use this if you just want memory resident.

chroma_client = chromadb.Client()


OR


4. Use this if you want to save your database somewhere.

persistentClient = chromadb.PersistentClient(path="/data")

I'm going to save my data to filesystem called "data", but you can save your data anywhere you'd like.

What's really cool about this is ChromaDB saves this data into a SQLite3 database for you.  If you're having trouble with it, take a look at the troubleshooting section on the Chroma website.(https://docs.trychroma.com/troubleshooting).

5. This is the really cool part!  Next we're creating our collection.  Here we're going to give it a name and you'll notice a geometry and physics term you probably hoped would never haunt you again, COSINE!  What's going on here is we're telling the database to find words that have different vectors.  If the vectors are different, they probably mean the opposite.  If the vectors are facing the same directions, it's more likely they mean similar things.  In Michael Warnow's blog he shows you how to find the similarity instead of the difference if you want to do that instead.

books_collection = persistentClient.create_collection( name="books", 

       metadata={"hnsw:space": "cosine"}

)

6.  Next thing we do is add data to our books collection.  I'm going to use Hasini Madushika book collection because her book titles do a great job showing how the cosine feature works.  There's a lot going on here, but I think I'll break this down further in another blog about embeddings and such.


books_collection.add(

    documents=[

  "The Enigma Code", "Decoding Secrets", "Whispers of Intrigue", "Conundrums and Clues",

  "The Puzzle Master", "Mysterious Ciphers", "The Labyrinth of Enigmas", "Cryptic Chronicles",

  "Puzzled Minds", "Secrets Unveiled", "Echoes of Eternity", "Time's Embrace",

  "Chronicles of Eternity", "Eternal Moments", "Timeless Whispers", "Infinity's Tapestry",

  "Temporal Odyssey", "Endless Hours", "The Time Weaver", "Eternal Sands",

  "The Silent Symphony", "Whispers of Silence", "Silent Serenade", "The Sound of Quiet",

  "Quiet Harmony", "Harmony in Silence", "Muted Melodies", "Serenity's Echo",

  "The Tranquil Note", "Echoes of Quietude"

  ],

    metadatas=[{"author":"Alan Cipher", "price":"$19.99"},{"author":"Olivia Mystery", "price":"$18.95"},

{"author":"James Riddle", "price":"$21.50"},{"author":"Emma Puzzler", "price":"$22.99"},

{"author":"Alex Brainteaser", "price":"$20.75"},{"author":"Victoria Enigma", "price":"$23.45"},

{"author":"Samuel Conundrum", "price":"$24.80"},{"author":"Grace Enigma", "price":"$19.25"},

{"author":"Daniel Riddle", "price":"$17.99"},{"author":"Amanda Mystery", "price":"$21.00"},

{"author":"Robert Timeless", "price":"$25.50"},{"author":"Sarah Infinity", "price":"$26.75"},

{"author":"Michael Eternal", "price":"$24.99"},{"author":"Emily Timekeeper", "price":"$23.20"},

{"author":"Christopher Infinity", "price":"$22.45"},{"author":"Jessica Forever", "price":"$27.30"},

{"author":"Nicholas Timeless", "price":"$28.50"},{"author":"Laura Infinity", "price":"$26.00"},

{"author":"Benjamin Chronos", "price":"$24.95"},{"author":"Rachel Timebound", "price":"$25.75"},

{"author":"William Hush", "price":"$18.50"},{"author":"Sophia Mute", "price":"$17.75"},

{"author":"Oliver Quietude", "price":"$19.20"},{"author":"Isabella Hushington", "price":"$20.15"},

{"author":"Matthew Serene", "price":"$18.99"},{"author":"Emily Tranquil", "price":"$21.50"},

{"author":"Christopher Hushwell", "price":"$22.75"},{"author":"Grace Silentheart", "price":"$19.95"},

{"author":"Daniel Peaceful", "price":"$23.00"},{"author":"Victoria Hushed", "price":"$20.80"}

  ],

    ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9", "id10", "id11", "id12", "id13", "id14", "id15", 

"id16", "id17", "id18", "id19", "id20", "id21", "id22", "id23", "id24", "id25", "id26", "id27", "id28", "id29", "id30"

  ]

)

7.  Now that we've got data in our database we can query it!

results = books_collection.query(

  query_texts=["Eternity", "Puzzle"],

  n_results=5

)

print(results)


{'ids': [['id11', 'id13', 'id14', 'id18', 'id20'], ['id5', 'id9', 'id1', 'id4', 'id19']], 'distances': [[0.33300162331174143, 0.378593976226286, 0.42385445272206546, 0.45082819934444696, 0.5292349798068949], [0.44286587397375554, 0.5532287339143328, 0.5986420831809907, 0.6688832557674483, 0.6887007582505835]], 'metadatas': [[{'author': 'Robert Timeless', 'price': '$25.50'}, {'author': 'Michael Eternal', 'price': '$24.99'}, {'author': 'Emily Timekeeper', 'price': '$23.20'}, {'author': 'Laura Infinity', 'price': '$26.00'}, {'author': 'Rachel Timebound', 'price': '$25.75'}], [{'author': 'Alex Brainteaser', 'price': '$20.75'}, {'author': 'Daniel Riddle', 'price': '$17.99'}, {'author': 'Alan Cipher', 'price': '$19.99'}, {'author': 'Emma Puzzler', 'price': '$22.99'}, {'author': 'Benjamin Chronos', 'price': '$24.95'}]], 'embeddings': None, 'documents': [['Echoes of Eternity', 'Chronicles of Eternity', 'Eternal Moments', 'Endless Hours', 'Eternal Sands'], ['The Puzzle Master', 'Puzzled Minds', 'The Enigma Code', 'Conundrums and Clues', 'The Time Weaver']], 'uris': None, 'data': None}


8.  Let's do a couple of other queries using our cosine vector.

print("results for 'Eternity':", results["documents"][0])

print("results for 'Puzzle':", results["documents"][1])


results for 'Eternity': ['Echoes of Eternity', 'Chronicles of Eternity', 'Eternal Moments', 'Endless Hours', 'Eternal Sands']

results for 'Puzzle': ['The Puzzle Master', 'Puzzled Minds', 'The Enigma Code', 'Conundrums and Clues', 'The Time Weaver']

This is really cool, notice the key words are Eternity and Puzzle.  The query find the exact word, but also words with a similar meaning.  They may not have the same magnitude, but isn't that cool?!?!

Until Next Time!

Neil

No comments:

Post a Comment