You are here

Playing with MariaDB Vector for initial AI tests

Artificial intelligence (AI) and vector databases are on everyone's lips these days. Since MariaDB will soon be coming onto the market with vector database functionality, as a database consultant I thought it was about time I got to grips with the subject so that I have at least a hint of what it's all about...

Since I'm not so much of a theoretician but rather like to do something practical, I've built a small "AI" prototype that anyone can build very quickly and easily on their laptop (without a GPU)...

I also took the liberty of stealing the graphs from the MariaDB Foundation presentation (see sources at the end).

Downloading the MariaDB database with vector functionality

There are no MariaDB packages with vector functionality yet, but the source code is already available. So you can quickly build the binaries yourself. This took just under an hour on my old box. Once the binaries are built, you can make a tarball out of them:

# tar xf mariadb-11.6.0_vector.tar.gz
# cd mariadb-11.6.0_vector/
# cmake .
# make
# make package

The MariaDB database then only needs to be started.

The model

To show the concept of tokenisation I decided to build an AI for URLs and to show the concept of different models and their potential for improvement I built a very stupid model in PHP, which simply decomposes a URL.

The question that this model should be able to answer is: "Give me similar URLs to the following URL."

The corresponding table looks like this:

DROP TABLE IF EXISTS `urls`;
-- TRUNCATE TABLE is NOT sufficient!!!
CREATE TABLE `urls` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, `url` varchar(1024) DEFAULT NULL
, `title` varchar(2000) DEFAULT NULL
, `embedding` blob NOT NULL
, VECTOR KEY `embedding` (`embedding`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;

The model fromdual_llm_v1 can be downloaded here.

This diagram from the MariaDB Foundation shows roughly how the whole thing works:

Training the AI

The database is then trained: The URL is taken as given and the title can be read out using an HTML scraper, for example. Here are 8 training datasets:


The vectors are then generated using our model:

(./fromdual_llm_v1.php https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.6.0+Vector&os=source
./fromdual_llm_v1.php https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/
./fromdual_llm_v1.php https://mariadb.org/wp-content/uploads/2024/02/MariaDB-Vector.pdf
./fromdual_llm_v1.php https://mariadb.com/resources/blog/mariadb-vector-preview-is-out/
./fromdual_llm_v1.php https://mariadb.org/projects/mariadb-vector/
./fromdual_llm_v1.php https://metacpan.org/pod/Perl::Tokenizer
./fromdual_llm_v1.php https://www.qwak.com/post/utilizing-llms-with-embedding-stores
./fromdual_llm_v1.php https://github.com/qwak-ai/qwak-examples/tree/main/qa_bot_falcon_chroma) | grep '^\['
[0.2, 0.0107421875, 0, 0, 0, 0.0006103515625, 0.00054931640625, 0]
[0.2, 0.0107421875, 0, 0, 0, 0.00262451171875, 0, 0]
[0.2, 0.0107421875, 0, 0, 0, 0.0028076171875, 0, 0]
[0.2, 0.0107421875, 0, 0, 0, 0.0028076171875, 0, 0]
[0.2, 0.0107421875, 0, 0, 0, 0.00152587890625, 0, 0]
[0.2, 0.01171875, 0, 0, 0, 0.001220703125, 0, 0]
[0.2, 0.01171875, 0, 0, 0, 0.0025634765625, 0, 0]
[0.2, 0.009765625, 0, 0, 0, 0.00323486328125, 0, 0]

The database is now fed (trained) with these vectors:

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.6.0+Vector&os=source'
, 'Download MariaDB Server'
, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.00262451171875, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/'
, 'Creating the MariaDB Binary Tarball'
, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.0006103515625, 0.00054931640625, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://mariadb.org/wp-content/uploads/2024/02/MariaDB-Vector.pdf'
, 'MariaDB Vector'
, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.0028076171875, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://mariadb.com/resources/blog/mariadb-vector-preview-is-out/'
, 'MariaDB Vector preview is out'
, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.0028076171875, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://mariadb.org/projects/mariadb-vector/'
, 'MariaDB Vector'
, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.00152587890625, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://metacpan.org/pod/Perl::Tokenizer'
, 'Perl::Tokenizer - A tiny Perl code tokenizer'
, VEC_FromText('[0.2, 0.01171875, 0, 0, 0, 0.001220703125, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://www.qwak.com/post/utilizing-llms-with-embedding-stores'
, 'Integrating Vector Databases with LLMs: A Hands-On Guide'
, VEC_FromText('[0.2, 0.01171875, 0, 0, 0, 0.0025634765625, 0, 0]')
);

INSERT INTO `urls` (id, url, title, embedding) VALUES (
  NULL
, 'https://github.com/qwak-ai/qwak-examples/tree/main/qa_bot_falcon_chroma'
, 'LLM Model Enhanced with Vector DB'
, VEC_FromText('[0.2, 0.009765625, 0, 0, 0, 0.00323486328125, 0, 0]')
);

Here is an overview of what is now in the database:

SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
;
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+
| id | url                                                                         | title                                                    | VEC_ToText(embedding)                                                     |
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+
|  1 | https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.6.0+Vector&os=source | Download MariaDB Server                                  | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002625,0.000000,0.000000] |
|  2 | https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/              | Creating the MariaDB Binary Tarball                      | [0.200000,0.010742,0.000000,0.000000,0.000000,0.000610,0.000549,0.000000] |
|  3 | https://mariadb.org/wp-content/uploads/2024/02/MariaDB-Vector.pdf           | MariaDB Vector                                           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002808,0.000000,0.000000] |
|  4 | https://mariadb.com/resources/blog/mariadb-vector-preview-is-out/           | MariaDB Vector preview is out                            | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002808,0.000000,0.000000] |
|  5 | https://mariadb.org/projects/mariadb-vector/                                | MariaDB Vector                                           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.001526,0.000000,0.000000] |
|  6 | https://metacpan.org/pod/Perl::Tokenizer                                    | Perl::Tokenizer - A tiny Perl code tokenizer             | [0.200000,0.011719,0.000000,0.000000,0.000000,0.001221,0.000000,0.000000] |
|  7 | https://www.qwak.com/post/utilizing-llms-with-embedding-stores              | Integrating Vector Databases with LLMs: A Hands-On Guide | [0.200000,0.011719,0.000000,0.000000,0.000000,0.002563,0.000000,0.000000] |
|  8 | https://github.com/qwak-ai/qwak-examples/tree/main/qa_bot_falcon_chroma     | LLM Model Enhanced with Vector DB                        | [0.200000,0.009766,0.000000,0.000000,0.000000,0.003235,0.000000,0.000000] |
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+

Search in the MariaDB vector database

Now comes the exciting part of the whole story: Can we also find something in our MariaDB vector database with URLs?

How this works schematically can again be seen in the MariaDB Foundation diagram:

The first attempt is a perfect match:

./fromdual_llm_v1.php https://mariadb.org/projects/mariadb-vector/
[0.2, 0.0107421875, 0, 0, 0, 0.00152587890625, 0, 0]

SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
 ORDER BY VEC_DISTANCE(embedding, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.00152587890625, 0, 0]'))
LIMIT 3
;
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
| id | url                                                            | title                                         | VEC_ToText(embedding)                                                     |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
|  5 | https://mariadb.org/projects/mariadb-vector/                   | MariaDB Vector                                | [0.200000,0.010742,0.000000,0.000000,0.000000,0.001526,0.000000,0.000000] |
|  6 | https://metacpan.org/pod/Perl::Tokenizer                       | Perl::Tokenizer - A tiny Perl code tokenizer. | [0.200000,0.011719,0.000000,0.000000,0.000000,0.001221,0.000000,0.000000] |
|  2 | https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/ | Creating the MariaDB Binary Tarball           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.000610,0.000549,0.000000] |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+

The first row matches 100%. Then the results get much worse relatively quickly...

Second attempt a similar URL:

./fromdual_llm_v1.php https://mariadb.com/kb/en/e4201/
[0.2, 0.0107421875, 0, 0, 0, 0.00079345703125, 0, 0]

SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
 ORDER BY VEC_DISTANCE(embedding, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.00079345703125, 0, 0]'))
LIMIT 3
;
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
| id | url                                                            | title                                         | VEC_ToText(embedding)                                                     |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
|  2 | https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/ | Creating the MariaDB Binary Tarball           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.000610,0.000549,0.000000] |
|  5 | https://mariadb.org/projects/mariadb-vector/                   | MariaDB Vector                                | [0.200000,0.010742,0.000000,0.000000,0.000000,0.001526,0.000000,0.000000] |
|  6 | https://metacpan.org/pod/Perl::Tokenizer                       | Perl::Tokenizer - A tiny Perl code tokenizer. | [0.200000,0.011719,0.000000,0.000000,0.000000,0.001221,0.000000,0.000000] |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+

Here I would expect only mariadb URLs among the first 3 hits. But this is not the case. So our model still has room for improvement here!

And another similar URL:

./fromdual_llm_v1.php https://mariadb.com/kb/en/vec_totext/
[0.2, 0.0107421875, 0, 0, 0, 0.0010986328125, 0, 0]

SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
 ORDER BY VEC_DISTANCE(embedding, VEC_FromText('[0.2, 0.0107421875, 0, 0, 0, 0.0010986328125, 0, 0]'))
LIMIT 3
;
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
| id | url                                                            | title                                         | VEC_ToText(embedding)                                                     |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+
|  5 | https://mariadb.org/projects/mariadb-vector/                   | MariaDB Vector                                | [0.200000,0.010742,0.000000,0.000000,0.000000,0.001526,0.000000,0.000000] |
|  2 | https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/ | Creating the MariaDB Binary Tarball           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.000610,0.000549,0.000000] |
|  6 | https://metacpan.org/pod/Perl::Tokenizer                       | Perl::Tokenizer - A tiny Perl code tokenizer. | [0.200000,0.011719,0.000000,0.000000,0.000000,0.001221,0.000000,0.000000] |
+----+----------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------+

Same problem here. The hostname is not weighted enough. You can/must probably play with the scatter that mariadb.org and mariadb.com generate.

And last but not least, a URL that does not appear in the data set at all:

./fromdual_llm_v1.php https://www.mongodb.com/blog/post/vector-search-llm-essentials-what-when-why
[0.2, 0.0146484375, 0, 0, 0, 0.00323486328125, 0, 0]
SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
 ORDER BY VEC_DISTANCE(embedding, VEC_FromText('[0.2, 0.0146484375, 0, 0, 0, 0.00323486328125, 0, 0]'))
LIMIT 5
;
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+
| id | url                                                                         | title                                                    | VEC_ToText(embedding)                                                     |
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+
|  7 | https://www.qwak.com/post/utilizing-llms-with-embedding-stores              | Integrating Vector Databases with LLMs: A Hands-On Guide | [0.200000,0.011719,0.000000,0.000000,0.000000,0.002563,0.000000,0.000000] |
|  6 | https://metacpan.org/pod/Perl::Tokenizer                                    | Perl::Tokenizer - A tiny Perl code tokenizer.            | [0.200000,0.011719,0.000000,0.000000,0.000000,0.001221,0.000000,0.000000] |
|  3 | https://mariadb.org/wp-content/uploads/2024/02/MariaDB-Vector.pdf           | MariaDB Vector                                           | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002808,0.000000,0.000000] |
|  4 | https://mariadb.com/resources/blog/mariadb-vector-preview-is-out/           | MariaDB Vector preview is out                            | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002808,0.000000,0.000000] |
|  1 | https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.6.0+Vector&os=source | Download MariaDB Server                                  | [0.200000,0.010742,0.000000,0.000000,0.000000,0.002625,0.000000,0.000000] |
+----+-----------------------------------------------------------------------------+----------------------------------------------------------+---------------------------------------------------------------------------+

Here the result seems to be completely arbitrary. But if you compare the vector of the query with the vectors of the results, the order makes sense... Are the dimensions in the vector evaluated from left to right? After all, the distance between two points in 8-dimensional space is to be determined...

Improvements in the model

The results of our AI are not yet particularly impressive. On the one hand, this is certainly due to the very limited amount of data, on the other hand, we have not yet modelled very important criteria in our model or have used completely nonsensical criteria.

Suggestions for improvement for a next model: The hostname could also be tokenised so that mariadb.com and mariadb.org are closer together.

The length of hostname, path, query and fragment is certainly not a particularly clever criterion for mapping the similarity of URLs. Much more intelligence would be needed in the model here. A function 1/CRC32(dim) might already provide slightly better results?

The title could be included, or at least the most important words (nouns, verbs) from the title.

The document type (MIME type) could be included: Is a PDF more similar to another PDF than to a CSV file or a static HMTL page or a dynamic PHP page?

Points that stood out while playing

The number of dimensions in a vector seem to be set on the first INSERT. If you subsequently enter data with a different vector length, the following error appears:

INSERT INTO products (name, description, embedding) VALUES (
  'Coffee Machine'
, 'Built to make the best coffee you can imagine'
, VEC_FromText('[0.2, 0.013671875, 0, 0, 0, 6.103515625E-5, 0, 0]')
);
ERROR 1366 (22007): Incorrect vector value: '...' for column `test`.`products`.`embedding` at row 1

Changing the vector length is currently NOT possible with a TRUNCATE TABLE command. The table must be dropped (DROP TABLE) and created again (CREATE TABLE).

However, searching with a shorter vector is possible:

SELECT id, url, title, VEC_ToText(embedding)
  FROM urls
 ORDER BY VEC_DISTANCE(embedding, VEC_FromText('[0.2, 0.0107421875]'))
LIMIT 3
;
+----+-------------------------------------------------------------------------+-------------------------------------+---------------------------------------------------------------------------+
| id | url                                                                     | title                               | VEC_ToText(embedding)                                                     |
+----+-------------------------------------------------------------------------+-------------------------------------+---------------------------------------------------------------------------+
|  8 | https://github.com/qwak-ai/qwak-examples/tree/main/qa_bot_falcon_chroma | LLM Model Enhanced with Vector DB   | [0.200000,0.009766,0.000000,0.000000,0.000000,0.003235,0.000000,0.000000] |
|  2 | https://mariadb.com/kb/en/creating-the-mariadb-binary-tarball/          | Creating the MariaDB Binary Tarball | [0.200000,0.010742,0.000000,0.000000,0.000000,0.000610,0.000549,0.000000] |
|  5 | https://mariadb.org/projects/mariadb-vector/                            | MariaDB Vector                      | [0.200000,0.010742,0.000000,0.000000,0.000000,0.001526,0.000000,0.000000] |
+----+-------------------------------------------------------------------------+-------------------------------------+---------------------------------------------------------------------------+

I cannot (yet) judge whether the result is so useful.

Sources


Taxonomy upgrade extras: 

Comments

We have improved our model (see v2) so the results are a bit better/more accurate now.

For usage see ./fromdual_llm_v2.php --help

Shinguzcomment