Artwork

Inhalt bereitgestellt von Oracle Corporation. Alle Podcast-Inhalte, einschließlich Episoden, Grafiken und Podcast-Beschreibungen, werden direkt von Oracle Corporation oder seinem Podcast-Plattformpartner hochgeladen und bereitgestellt. Wenn Sie glauben, dass jemand Ihr urheberrechtlich geschütztes Werk ohne Ihre Erlaubnis nutzt, können Sie dem hier beschriebenen Verfahren folgen https://de.player.fm/legal.
Player FM - Podcast-App
Gehen Sie mit der App Player FM offline!

MyVector Magic: Elevating MySQL with AI Search

19:02
 
Teilen
 

Manage episode 507068287 series 3568157
Inhalt bereitgestellt von Oracle Corporation. Alle Podcast-Inhalte, einschließlich Episoden, Grafiken und Podcast-Beschreibungen, werden direkt von Oracle Corporation oder seinem Podcast-Plattformpartner hochgeladen und bereitgestellt. Wenn Sie glauben, dass jemand Ihr urheberrechtlich geschütztes Werk ohne Ihre Erlaubnis nutzt, können Sie dem hier beschriebenen Verfahren folgen https://de.player.fm/legal.

Oracle Ace Alkin Tezuysal joins leFred and Scott to introduce the MyVector plugin for MySQL Community Edition, bringing powerful vector search capabilities to your favorite open-source database. Learn how MyVector enables advanced AI and similarity search features, why this matters for modern applications, and how the MySQL community can easily get started.

-------------------------------------------------------------

Episode Transcript:

00:00.000 --> 00:25.000 Welcome to Inside MySQL: Sakila Speaks, a podcast dedicated to all things MySQL. We bring you the latest news from the MySQL team, MySQL product updates and insightful interviews with members of the MySQL community.

00:25.000 --> 00:32.000 Sit back and enjoy as your hosts bring you the latest updates on your favorite open source database. Let's get started.

00:32.000 --> 00:37.000 Hello and welcome to Sakila Speaks, the podcast dedicated to MySQL. I'm LeFred.

00:37.000 --> 00:38.000 And I'm Scott Stroz.

00:38.000 --> 00:47.000 Joining us today is Alkin Tezuysal. We know each other for a long time already and Alkin serves as Director of Services at Altinity Inc.

00:47.000 --> 00:55.000 Bringing over 30 years of experience in open source relational databases with deep expertise in MySQL, of course, and ClickHouse.

00:55.000 --> 01:08.000 He co-authored key references works including MySQL Cookbook 4th edition that came in 2022 and Database Design and Modeling with Postgres and MySQL in 2024.

01:08.000 --> 01:21.000 Alkin, you have been honored as MySQL Rockstar in 2023. And since this year, you are also an Oracle Ace Pro for MySQL. Congratulations and welcome to Inside MySQL: Sakila Speaks.

01:21.000 --> 01:23.000 Thank you very much, everyone.

01:23.000 --> 01:34.000 We're glad you're here. Alkin, as you may not know, this season of the podcast is dedicated to all things AI as it relates to MySQL and HeatWave.

01:34.000 --> 01:43.000 And you actually created or wrote a plugin for MySQL Community that kind of helped with that, MyVector.

01:43.000 --> 01:48.000 Can you give us an overview of what MyVector is and what problem it's meant to solve?

01:48.000 --> 01:50.000 Sure. Thank you very much for the question.

01:50.000 --> 02:00.000 And I'm very happy that this year of AI and HeatWave, everything that actually contributes to this technology because it's fairly new.

02:00.000 --> 02:06.000 It's been developing for many years, as we already know, but now it's in our hands.

02:06.000 --> 02:16.000 We can use it. We can definitely use it on our day-to-day activities, whether it's troubleshooting your dishwasher or your washing machine.

02:16.000 --> 02:20.000 But we could also use it in a business-wise database.

02:20.000 --> 02:29.000 So one correction I want to make is I am a contributor to MyVector plugin, not to author.

02:29.000 --> 02:34.000 The author is Shankar Iyer, and he's a developer for databases for many years.

02:34.000 --> 02:40.000 He's got a lot of experience where I've actually been presenting and supporting this project.

02:40.000 --> 02:49.000 And that's the small correction. Other than that, MyVector is a native plugin for MySQL that adds support for storing and searching high dimensional vectors.

02:49.000 --> 02:55.000 This is basically a very, in simple terms, what it does.

02:55.000 --> 03:00.000 And this has been in development for some time.

03:00.000 --> 03:14.000 And as we have seen other, you know, databases, other open source databases also went into this with the, you know, launching of AI to our, you know, end users.

03:14.000 --> 03:24.000 Adding approximate nearest neighbor n-search directly in SQL within MySQL database was kind of needed.

03:24.000 --> 03:29.000 And there has been similar implementations with MySQL.

03:29.000 --> 03:33.000 But MyVector is the open source version of that as a plugin.

03:33.000 --> 03:39.000 So just to wrap up that answer is MyVector column type for embedding storage.

03:39.000 --> 03:41.000 And there's a MyVector.

03:41.000 --> 03:46.000 There's a bunch of functions that MyVector distance for the similarity competition.

03:46.000 --> 03:50.000 Of course, it uses HNSW-based index algorithm, which is very popular.

03:50.000 --> 03:52.000 There's a white paper around it.

03:52.000 --> 04:01.000 It's not a rocket science or just something that was invented for MyVector that is known science.

04:01.000 --> 04:06.000 And basically, it provides an SQL native interface within MySQL.

04:06.000 --> 04:08.000 Hope that answers that question.

04:08.000 --> 04:10.000 Thank you very much, Alkin, yeah.

04:10.000 --> 04:22.000 It answers everything and very happy that you also, let's say, talk about the author that we already met also in Belgium recently.

04:22.000 --> 04:31.000 So I would like to ask you, so why is it important to have this similarity search indexes in MySQL then?

04:31.000 --> 04:40.000 Yeah. So again, going back to the AI-driven application, semantic search, product recommendation, question and answering, anomaly detection, etc.

04:40.000 --> 04:43.000 These really require a similarity searches.

04:43.000 --> 04:47.000 Have we done similarity searches in the past? Yes, we have.

04:47.000 --> 04:52.000 If you remember, this is a long, long time ago, but those technologies are still in effect.

04:52.000 --> 05:03.000 And we had search indexes like the Solr, this Phoenix, if you recall those, where we used to have a replica, generate index and search for it.

05:03.000 --> 05:10.000 I used to work for an e-commerce site and users would search for a product.

05:10.000 --> 05:15.000 And then we would also display the similar products.

05:15.000 --> 05:23.000 And in order to do that in MySQL, we had to use external services like, like I said, some search.

05:23.000 --> 05:25.000 So it is very important.

05:25.000 --> 05:29.000 But with the AI-driven application, it's not important anymore.

05:29.000 --> 05:30.000 It's a must have.

05:30.000 --> 05:35.000 Basically, you don't need to run a separate vector database.

05:35.000 --> 05:45.000 And basically, if the data is already in MySQL, you could use this technology using, you know, similarity search functionalities.

05:45.000 --> 05:49.000 Back at FOSDEM, you gave a presentation about MyVector.

05:49.000 --> 05:55.000 And over the weekend at FOSDEM, there were a lot of other sessions about vector and indexes.

05:55.000 --> 06:01.000 Has MyVector made any significant changes since you last talked about it in public?

06:01.000 --> 06:07.000 Yes, there was another public talk after FOSDEM that was a vector search conference.

06:07.000 --> 06:14.000 And we've had a bunch of talks about vector searches, vector technologies, which was around this open source databases, including MySQL.

06:14.000 --> 06:19.000 There were, I think, four or five MySQL talks around the vector search.

06:19.000 --> 06:33.000 From the development side, yes, there's one important improvement that was made that was the necessary support for binary distributions other than the Docker images.

06:33.000 --> 06:43.000 So we worked on those and built, you know, three different versions of MySQL binary distributions for testing, because it's more like a DIY.

06:43.000 --> 06:51.000 And you have to compile and everyone is not very competent enough or have enough time to compile MySQL.

06:51.000 --> 07:02.000 So we built images for 8.0 and 8.4 and 9x versions for easy testing.

07:02.000 --> 07:12.000 And there were some improvements on performance and index stability, of course, and so that's about it.

07:12.000 --> 07:18.000 Maybe it doesn't sound a lot, but this is a lot of work, basically, considering it's an open source project.

07:18.000 --> 07:21.000 Yeah, thank you. I can imagine it's a lot of work.

07:21.000 --> 07:31.000 So let's go now in the more technical, let's dig a bit in technical and a bit deeper there.

07:31.000 --> 07:41.000 So you said earlier that MyVector is using this HNSW, which is a hierarchical navigable small world indexes, right?

07:41.000 --> 07:48.000 Why was this type chosen over other or over alternatives?

07:48.000 --> 07:55.000 And do you know if or you yourself have tried alternatives or not?

07:55.000 --> 07:59.000 We would like to know a bit more about why that choice.

07:59.000 --> 08:01.000 That's a great question, actually.

08:01.000 --> 08:11.000 And when we first all heard or started knowing about this HNSW, hierarchical navigable small word for the n-search, like approximate nearest neighbor search.

08:11.000 --> 08:21.000 That was, it sounded like when I did my research and started reading about it, I think we met with you in London last year.

08:21.000 --> 08:26.000 We were talking about this, you know, the n-search and everything else.

08:26.000 --> 08:33.000 This is basically, I thought it was more like a de facto standard of the n-search.

08:33.000 --> 08:44.000 And it turned out to be that way because a lot of the other open source databases or implementations were circling around HNSW.

08:44.000 --> 08:49.000 And that's not to say that there are not other options out there.

08:49.000 --> 09:00.000 But usually when technologies like this launched, you don't go and reinvent the wheel, but basically build upon an existing technology.

09:00.000 --> 09:09.000 Since HNSW was widely available in terms of a knowledge wise, it was chosen HNSW.

09:09.000 --> 09:13.000 And, you know, it has high accuracy.

09:13.000 --> 09:16.000 It's a, it's got support for dynamic inserts and leads.

09:16.000 --> 09:19.000 And, and it has an efficient memory usage.

09:19.000 --> 09:21.000 These are the top three things that I know about it.

09:21.000 --> 09:31.000 But, you know, you know, from the other open source databases, like I said, the benchmarking were all circling around this.

09:31.000 --> 09:41.000 And if you were to use a different indexing, it would be very difficult to compare apple to apple from a different indexing perspective.

09:41.000 --> 09:52.000 So, I think, again, I'm, I'm not saying there are no other and methods there are, but they might be less accurate.

09:52.000 --> 09:54.000 They may have different, options.

09:54.000 --> 10:04.000 but if you want to kind of, play something in the market that everybody knows, it would be better off, using the known, methodologies.

10:04.000 --> 10:11.000 So you've given me something I need to look up so I know what I'm going to be doing over the weekend, which HNSW.

10:11.000 --> 10:18.000 so does the data that we use need to be trained before it can be indexed?

10:18.000 --> 10:20.000 No, there's no training.

10:20.000 --> 10:23.000 Basically it's the, it's the, embeddings.

10:23.000 --> 10:30.000 The, the, the difference between the training and the embeddings is you just need to generate the embeddings.

10:30.000 --> 10:55.000 And that's where, that's where an additional step, like if your data is already in the database and you, you want to use this, vector, search technology using HNSW indexing for the n-search, you need to generate the embeddings, whether externally or internally with, with, with a service or, something like that.

10:55.000 --> 11:18.000 We, I know that there are some, the, the other types of index that are maybe less popular, that, index the embeddings, sometime they also need to, to have some training before, but, yeah, this one doesn't, which is, which is good because every time you want to, to add the data, whatever, it's quite complicated if you want to train it.

11:18.000 --> 11:19.000 Right.

11:19.000 --> 11:20.000 Yeah.

11:20.000 --> 11:25.000 Basically it's, it's, it's generate the embedding, insert in the MySQL and build the HNSW index.

11:25.000 --> 11:38.420 So, as you are discussing about, this, this index, what, what I'm, curious because, I also try, I try and check, different type of flow indexes to, to understand what they do and what it is.

11:38.420 --> 11:46.160 but, I would like to know what's the size of this index compared to the actual size of the data, right?

11:46.440 --> 11:59.440 Because I know, and maybe it's the case, on your implementation that, the full representation of the, of the, of the vector is stored on the index on some of them or most of them.

11:59.440 --> 12:09.620 So I would like to know, if you have made some check there and, if, if the size is compared, right, to the, the, the full embeddings and the index.

12:09.920 --> 12:16.940 Just to recap that, the, the, the full vector is stored inside the index structure on a fast axis.

12:17.540 --> 12:22.560 So, so there's, there's no reference in back or anything like that.

12:22.560 --> 12:46.380 It's in the, this, we were talking about this, the size of the, index is that depends on the, vector dimension dimensions, a number of vectors that we're storing and, and then, and some of the parameters that, you know, per node, that, that index, but, we did some, some sizing and testing around it as yes.

12:46.380 --> 12:54.160 The accuracy increases when the dimensions are high as we know, and the size, size gets, gets higher.

12:54.440 --> 13:04.800 So, we're looking into this also, if there is any, any option to optimize that or, use some compression technology to, for this index.

13:05.160 --> 13:16.360 And, that's, something, is, is kind of, important to know that because this is not in, you know, DB, this is basically in the file system.

13:16.380 --> 13:19.080 And it needs to be, you know, placed correctly.

13:19.360 --> 13:32.200 You, you mentioned before how the, how MyVector is available and I know it's available as a plugin and just to clarify for our listeners, do you also provide the binary packages or do we still need to compile it, from the source?

13:32.980 --> 13:33.480 Yes.

13:33.480 --> 13:46.860 As I mentioned earlier, since FOSDEM the, the, the latest, release of, MyVector included, the, binary releases or, x86.

13:47.480 --> 13:53.840 And, and, that will, that is published in the GitHub page and, as open source.

13:54.100 --> 13:57.220 so you no longer need to compile it from the source.

13:57.220 --> 14:02.360 If you want to test it out, just to plug in, my, um,

14:02.360 --> 14:16.980 I have launched a new blog technical blog page and started blogging and, I will blog about this so, so that, the listeners and, and the readers can actually, have a link available in the, in the blog.

14:16.980 --> 14:28.440 So they can go in and test it out and Docker images were available, but, binary releases also added, recently, for the, for the test.

14:28.440 --> 14:30.380 Awesome. Thank you, Alkin.

14:30.720 --> 14:42.720 So last question, do you know if there are already, some, companies or, users, using MyVector in production or not yet?

14:43.260 --> 14:45.500 There are some POCs going on.

14:45.860 --> 14:47.700 And as you know, this is open source.

14:47.700 --> 14:58.360 there, there were some interest after force them, you know, we, people reached out, who were actually doing, this type of research and analysis.

14:58.360 --> 15:03.920 It was on the existing MySQL databases and, we provided help and information.

15:04.380 --> 15:05.540 They might have taken it.

15:05.600 --> 15:06.580 They might have forked it.

15:06.640 --> 15:08.980 They might have embedded into their existing implementation.

15:09.580 --> 15:16.500 we don't know, but, as far as I know, there, there are a few POCs are going, they're testing with the existing data.

15:16.500 --> 15:29.520 So, as you know, just to add up since FOSDEM, there has been one shift happened in this type of technology, which is the MCP servers.

15:29.520 --> 15:45.640 So, that is one thing that I wanted to add over here with that shift, generating embeddings and, and actually having an MCP server that will actually add context to the n-search.

15:45.640 --> 15:53.620 Like a chat bot implementation made it, made things a little bit more, not only useful, but also interesting.

15:54.320 --> 16:01.380 So say you have, you know, support tickets or some, some data that's actually related to your, you know, internal customers.

16:01.580 --> 16:03.600 You could add MCP server.

16:03.800 --> 16:06.160 There are some public MCP servers.

16:06.320 --> 16:09.780 There are some open source and MCP server implementations.

16:10.080 --> 16:12.300 And, and we're also looking into that.

16:12.560 --> 16:15.560 And, I want to mention over here on my last.

16:15.640 --> 16:24.200 talk during the vector search conference, I have actually, presented a MyVector with an MCP server demo.

16:24.200 --> 16:27.880 and, and that recording is should be available.

16:27.880 --> 16:30.900 So for that, and, and, and actually it works.

16:30.900 --> 16:44.520 this is pure MySQL open source, pure MyVector open source and pure MCP server open source with the, you know, clinical trials data, like the, one of the public data data sets that I've used.

16:44.520 --> 16:52.300 you can actually ask questions and it'll answer, and then you can continue the chat using this, this, very given technology.

16:52.300 --> 16:54.060 That's awesome.

16:54.060 --> 17:00.060 I've actually been playing around with, writing my own MCP servers and having them interact with MySQL.

17:00.060 --> 17:14.880 And I think MCP is going to be, is going to wind up being pretty, being pretty big because it does give that, domain specific context that the LLMs can actually use to generate content or answers or whatever.

17:14.880 --> 17:18.120 So I, I'm, I'm interested to play around with that.

17:19.000 --> 17:19.520 Absolutely.

17:19.740 --> 17:20.120 Absolutely.

17:20.120 --> 17:23.100 This is very interesting development, very recent.

17:23.740 --> 17:28.440 a lot of people are experimenting right now with the MCP servers.

17:29.160 --> 17:44.300 MCP servers, are going to be, I think the salt and pepper of, or, or sauce of, of this technology, you know, having the vectors and beddings and the, you know, n-search, HNSW index and the MCP server.

17:44.300 --> 17:47.220 So they, it's going to complete the puzzle in my opinion.

17:47.600 --> 17:50.920 And, and also, like I said, I've already given a demo.

17:51.060 --> 17:53.840 We are also looking into this, this technology.

17:54.120 --> 17:59.220 and, of course there's, that is also still under development.

17:59.740 --> 18:08.980 if you're opening up for a public MCP, then it's actually your, your security and compliance is, is now outside again.

18:08.980 --> 18:16.200 We want to do, if you want to do everything internally, if you want to do everything in your own database, with your own security and compliance.

18:16.700 --> 18:20.660 So that's a game changer to have something, available for yourself.

18:21.260 --> 18:21.660 Excellent.

18:21.840 --> 18:23.140 So thank you very much, Alkin.

18:23.660 --> 18:25.060 thank you for your time.

18:25.460 --> 18:27.740 We know you are on your boat right now, sailing.

18:27.960 --> 18:28.900 So that's awesome.

18:29.180 --> 18:29.580 Thank you.

18:29.640 --> 18:30.420 Thanks a lot, guys.

18:30.600 --> 18:31.120 Thank you, Alkin.

18:31.180 --> 18:33.120 And thank you for all your contributions to the community.

18:33.120 --> 18:36.880 That's a wrap on this episode of Inside MySQL: Sakila Speaks.

18:37.040 --> 18:38.280 Thanks for hanging out with us.

18:38.560 --> 18:42.320 If you enjoyed listening, please click subscribe to get all the latest episodes.

18:42.600 --> 18:45.540 We would also love your reviews and ratings on your podcast app.

18:45.860 --> 18:50.080 Be sure to join us for the next episode of Inside MySQL: Sakila Speaks.

  continue reading

15 Episoden

Artwork
iconTeilen
 
Manage episode 507068287 series 3568157
Inhalt bereitgestellt von Oracle Corporation. Alle Podcast-Inhalte, einschließlich Episoden, Grafiken und Podcast-Beschreibungen, werden direkt von Oracle Corporation oder seinem Podcast-Plattformpartner hochgeladen und bereitgestellt. Wenn Sie glauben, dass jemand Ihr urheberrechtlich geschütztes Werk ohne Ihre Erlaubnis nutzt, können Sie dem hier beschriebenen Verfahren folgen https://de.player.fm/legal.

Oracle Ace Alkin Tezuysal joins leFred and Scott to introduce the MyVector plugin for MySQL Community Edition, bringing powerful vector search capabilities to your favorite open-source database. Learn how MyVector enables advanced AI and similarity search features, why this matters for modern applications, and how the MySQL community can easily get started.

-------------------------------------------------------------

Episode Transcript:

00:00.000 --> 00:25.000 Welcome to Inside MySQL: Sakila Speaks, a podcast dedicated to all things MySQL. We bring you the latest news from the MySQL team, MySQL product updates and insightful interviews with members of the MySQL community.

00:25.000 --> 00:32.000 Sit back and enjoy as your hosts bring you the latest updates on your favorite open source database. Let's get started.

00:32.000 --> 00:37.000 Hello and welcome to Sakila Speaks, the podcast dedicated to MySQL. I'm LeFred.

00:37.000 --> 00:38.000 And I'm Scott Stroz.

00:38.000 --> 00:47.000 Joining us today is Alkin Tezuysal. We know each other for a long time already and Alkin serves as Director of Services at Altinity Inc.

00:47.000 --> 00:55.000 Bringing over 30 years of experience in open source relational databases with deep expertise in MySQL, of course, and ClickHouse.

00:55.000 --> 01:08.000 He co-authored key references works including MySQL Cookbook 4th edition that came in 2022 and Database Design and Modeling with Postgres and MySQL in 2024.

01:08.000 --> 01:21.000 Alkin, you have been honored as MySQL Rockstar in 2023. And since this year, you are also an Oracle Ace Pro for MySQL. Congratulations and welcome to Inside MySQL: Sakila Speaks.

01:21.000 --> 01:23.000 Thank you very much, everyone.

01:23.000 --> 01:34.000 We're glad you're here. Alkin, as you may not know, this season of the podcast is dedicated to all things AI as it relates to MySQL and HeatWave.

01:34.000 --> 01:43.000 And you actually created or wrote a plugin for MySQL Community that kind of helped with that, MyVector.

01:43.000 --> 01:48.000 Can you give us an overview of what MyVector is and what problem it's meant to solve?

01:48.000 --> 01:50.000 Sure. Thank you very much for the question.

01:50.000 --> 02:00.000 And I'm very happy that this year of AI and HeatWave, everything that actually contributes to this technology because it's fairly new.

02:00.000 --> 02:06.000 It's been developing for many years, as we already know, but now it's in our hands.

02:06.000 --> 02:16.000 We can use it. We can definitely use it on our day-to-day activities, whether it's troubleshooting your dishwasher or your washing machine.

02:16.000 --> 02:20.000 But we could also use it in a business-wise database.

02:20.000 --> 02:29.000 So one correction I want to make is I am a contributor to MyVector plugin, not to author.

02:29.000 --> 02:34.000 The author is Shankar Iyer, and he's a developer for databases for many years.

02:34.000 --> 02:40.000 He's got a lot of experience where I've actually been presenting and supporting this project.

02:40.000 --> 02:49.000 And that's the small correction. Other than that, MyVector is a native plugin for MySQL that adds support for storing and searching high dimensional vectors.

02:49.000 --> 02:55.000 This is basically a very, in simple terms, what it does.

02:55.000 --> 03:00.000 And this has been in development for some time.

03:00.000 --> 03:14.000 And as we have seen other, you know, databases, other open source databases also went into this with the, you know, launching of AI to our, you know, end users.

03:14.000 --> 03:24.000 Adding approximate nearest neighbor n-search directly in SQL within MySQL database was kind of needed.

03:24.000 --> 03:29.000 And there has been similar implementations with MySQL.

03:29.000 --> 03:33.000 But MyVector is the open source version of that as a plugin.

03:33.000 --> 03:39.000 So just to wrap up that answer is MyVector column type for embedding storage.

03:39.000 --> 03:41.000 And there's a MyVector.

03:41.000 --> 03:46.000 There's a bunch of functions that MyVector distance for the similarity competition.

03:46.000 --> 03:50.000 Of course, it uses HNSW-based index algorithm, which is very popular.

03:50.000 --> 03:52.000 There's a white paper around it.

03:52.000 --> 04:01.000 It's not a rocket science or just something that was invented for MyVector that is known science.

04:01.000 --> 04:06.000 And basically, it provides an SQL native interface within MySQL.

04:06.000 --> 04:08.000 Hope that answers that question.

04:08.000 --> 04:10.000 Thank you very much, Alkin, yeah.

04:10.000 --> 04:22.000 It answers everything and very happy that you also, let's say, talk about the author that we already met also in Belgium recently.

04:22.000 --> 04:31.000 So I would like to ask you, so why is it important to have this similarity search indexes in MySQL then?

04:31.000 --> 04:40.000 Yeah. So again, going back to the AI-driven application, semantic search, product recommendation, question and answering, anomaly detection, etc.

04:40.000 --> 04:43.000 These really require a similarity searches.

04:43.000 --> 04:47.000 Have we done similarity searches in the past? Yes, we have.

04:47.000 --> 04:52.000 If you remember, this is a long, long time ago, but those technologies are still in effect.

04:52.000 --> 05:03.000 And we had search indexes like the Solr, this Phoenix, if you recall those, where we used to have a replica, generate index and search for it.

05:03.000 --> 05:10.000 I used to work for an e-commerce site and users would search for a product.

05:10.000 --> 05:15.000 And then we would also display the similar products.

05:15.000 --> 05:23.000 And in order to do that in MySQL, we had to use external services like, like I said, some search.

05:23.000 --> 05:25.000 So it is very important.

05:25.000 --> 05:29.000 But with the AI-driven application, it's not important anymore.

05:29.000 --> 05:30.000 It's a must have.

05:30.000 --> 05:35.000 Basically, you don't need to run a separate vector database.

05:35.000 --> 05:45.000 And basically, if the data is already in MySQL, you could use this technology using, you know, similarity search functionalities.

05:45.000 --> 05:49.000 Back at FOSDEM, you gave a presentation about MyVector.

05:49.000 --> 05:55.000 And over the weekend at FOSDEM, there were a lot of other sessions about vector and indexes.

05:55.000 --> 06:01.000 Has MyVector made any significant changes since you last talked about it in public?

06:01.000 --> 06:07.000 Yes, there was another public talk after FOSDEM that was a vector search conference.

06:07.000 --> 06:14.000 And we've had a bunch of talks about vector searches, vector technologies, which was around this open source databases, including MySQL.

06:14.000 --> 06:19.000 There were, I think, four or five MySQL talks around the vector search.

06:19.000 --> 06:33.000 From the development side, yes, there's one important improvement that was made that was the necessary support for binary distributions other than the Docker images.

06:33.000 --> 06:43.000 So we worked on those and built, you know, three different versions of MySQL binary distributions for testing, because it's more like a DIY.

06:43.000 --> 06:51.000 And you have to compile and everyone is not very competent enough or have enough time to compile MySQL.

06:51.000 --> 07:02.000 So we built images for 8.0 and 8.4 and 9x versions for easy testing.

07:02.000 --> 07:12.000 And there were some improvements on performance and index stability, of course, and so that's about it.

07:12.000 --> 07:18.000 Maybe it doesn't sound a lot, but this is a lot of work, basically, considering it's an open source project.

07:18.000 --> 07:21.000 Yeah, thank you. I can imagine it's a lot of work.

07:21.000 --> 07:31.000 So let's go now in the more technical, let's dig a bit in technical and a bit deeper there.

07:31.000 --> 07:41.000 So you said earlier that MyVector is using this HNSW, which is a hierarchical navigable small world indexes, right?

07:41.000 --> 07:48.000 Why was this type chosen over other or over alternatives?

07:48.000 --> 07:55.000 And do you know if or you yourself have tried alternatives or not?

07:55.000 --> 07:59.000 We would like to know a bit more about why that choice.

07:59.000 --> 08:01.000 That's a great question, actually.

08:01.000 --> 08:11.000 And when we first all heard or started knowing about this HNSW, hierarchical navigable small word for the n-search, like approximate nearest neighbor search.

08:11.000 --> 08:21.000 That was, it sounded like when I did my research and started reading about it, I think we met with you in London last year.

08:21.000 --> 08:26.000 We were talking about this, you know, the n-search and everything else.

08:26.000 --> 08:33.000 This is basically, I thought it was more like a de facto standard of the n-search.

08:33.000 --> 08:44.000 And it turned out to be that way because a lot of the other open source databases or implementations were circling around HNSW.

08:44.000 --> 08:49.000 And that's not to say that there are not other options out there.

08:49.000 --> 09:00.000 But usually when technologies like this launched, you don't go and reinvent the wheel, but basically build upon an existing technology.

09:00.000 --> 09:09.000 Since HNSW was widely available in terms of a knowledge wise, it was chosen HNSW.

09:09.000 --> 09:13.000 And, you know, it has high accuracy.

09:13.000 --> 09:16.000 It's a, it's got support for dynamic inserts and leads.

09:16.000 --> 09:19.000 And, and it has an efficient memory usage.

09:19.000 --> 09:21.000 These are the top three things that I know about it.

09:21.000 --> 09:31.000 But, you know, you know, from the other open source databases, like I said, the benchmarking were all circling around this.

09:31.000 --> 09:41.000 And if you were to use a different indexing, it would be very difficult to compare apple to apple from a different indexing perspective.

09:41.000 --> 09:52.000 So, I think, again, I'm, I'm not saying there are no other and methods there are, but they might be less accurate.

09:52.000 --> 09:54.000 They may have different, options.

09:54.000 --> 10:04.000 but if you want to kind of, play something in the market that everybody knows, it would be better off, using the known, methodologies.

10:04.000 --> 10:11.000 So you've given me something I need to look up so I know what I'm going to be doing over the weekend, which HNSW.

10:11.000 --> 10:18.000 so does the data that we use need to be trained before it can be indexed?

10:18.000 --> 10:20.000 No, there's no training.

10:20.000 --> 10:23.000 Basically it's the, it's the, embeddings.

10:23.000 --> 10:30.000 The, the, the difference between the training and the embeddings is you just need to generate the embeddings.

10:30.000 --> 10:55.000 And that's where, that's where an additional step, like if your data is already in the database and you, you want to use this, vector, search technology using HNSW indexing for the n-search, you need to generate the embeddings, whether externally or internally with, with, with a service or, something like that.

10:55.000 --> 11:18.000 We, I know that there are some, the, the other types of index that are maybe less popular, that, index the embeddings, sometime they also need to, to have some training before, but, yeah, this one doesn't, which is, which is good because every time you want to, to add the data, whatever, it's quite complicated if you want to train it.

11:18.000 --> 11:19.000 Right.

11:19.000 --> 11:20.000 Yeah.

11:20.000 --> 11:25.000 Basically it's, it's, it's generate the embedding, insert in the MySQL and build the HNSW index.

11:25.000 --> 11:38.420 So, as you are discussing about, this, this index, what, what I'm, curious because, I also try, I try and check, different type of flow indexes to, to understand what they do and what it is.

11:38.420 --> 11:46.160 but, I would like to know what's the size of this index compared to the actual size of the data, right?

11:46.440 --> 11:59.440 Because I know, and maybe it's the case, on your implementation that, the full representation of the, of the, of the vector is stored on the index on some of them or most of them.

11:59.440 --> 12:09.620 So I would like to know, if you have made some check there and, if, if the size is compared, right, to the, the, the full embeddings and the index.

12:09.920 --> 12:16.940 Just to recap that, the, the, the full vector is stored inside the index structure on a fast axis.

12:17.540 --> 12:22.560 So, so there's, there's no reference in back or anything like that.

12:22.560 --> 12:46.380 It's in the, this, we were talking about this, the size of the, index is that depends on the, vector dimension dimensions, a number of vectors that we're storing and, and then, and some of the parameters that, you know, per node, that, that index, but, we did some, some sizing and testing around it as yes.

12:46.380 --> 12:54.160 The accuracy increases when the dimensions are high as we know, and the size, size gets, gets higher.

12:54.440 --> 13:04.800 So, we're looking into this also, if there is any, any option to optimize that or, use some compression technology to, for this index.

13:05.160 --> 13:16.360 And, that's, something, is, is kind of, important to know that because this is not in, you know, DB, this is basically in the file system.

13:16.380 --> 13:19.080 And it needs to be, you know, placed correctly.

13:19.360 --> 13:32.200 You, you mentioned before how the, how MyVector is available and I know it's available as a plugin and just to clarify for our listeners, do you also provide the binary packages or do we still need to compile it, from the source?

13:32.980 --> 13:33.480 Yes.

13:33.480 --> 13:46.860 As I mentioned earlier, since FOSDEM the, the, the latest, release of, MyVector included, the, binary releases or, x86.

13:47.480 --> 13:53.840 And, and, that will, that is published in the GitHub page and, as open source.

13:54.100 --> 13:57.220 so you no longer need to compile it from the source.

13:57.220 --> 14:02.360 If you want to test it out, just to plug in, my, um,

14:02.360 --> 14:16.980 I have launched a new blog technical blog page and started blogging and, I will blog about this so, so that, the listeners and, and the readers can actually, have a link available in the, in the blog.

14:16.980 --> 14:28.440 So they can go in and test it out and Docker images were available, but, binary releases also added, recently, for the, for the test.

14:28.440 --> 14:30.380 Awesome. Thank you, Alkin.

14:30.720 --> 14:42.720 So last question, do you know if there are already, some, companies or, users, using MyVector in production or not yet?

14:43.260 --> 14:45.500 There are some POCs going on.

14:45.860 --> 14:47.700 And as you know, this is open source.

14:47.700 --> 14:58.360 there, there were some interest after force them, you know, we, people reached out, who were actually doing, this type of research and analysis.

14:58.360 --> 15:03.920 It was on the existing MySQL databases and, we provided help and information.

15:04.380 --> 15:05.540 They might have taken it.

15:05.600 --> 15:06.580 They might have forked it.

15:06.640 --> 15:08.980 They might have embedded into their existing implementation.

15:09.580 --> 15:16.500 we don't know, but, as far as I know, there, there are a few POCs are going, they're testing with the existing data.

15:16.500 --> 15:29.520 So, as you know, just to add up since FOSDEM, there has been one shift happened in this type of technology, which is the MCP servers.

15:29.520 --> 15:45.640 So, that is one thing that I wanted to add over here with that shift, generating embeddings and, and actually having an MCP server that will actually add context to the n-search.

15:45.640 --> 15:53.620 Like a chat bot implementation made it, made things a little bit more, not only useful, but also interesting.

15:54.320 --> 16:01.380 So say you have, you know, support tickets or some, some data that's actually related to your, you know, internal customers.

16:01.580 --> 16:03.600 You could add MCP server.

16:03.800 --> 16:06.160 There are some public MCP servers.

16:06.320 --> 16:09.780 There are some open source and MCP server implementations.

16:10.080 --> 16:12.300 And, and we're also looking into that.

16:12.560 --> 16:15.560 And, I want to mention over here on my last.

16:15.640 --> 16:24.200 talk during the vector search conference, I have actually, presented a MyVector with an MCP server demo.

16:24.200 --> 16:27.880 and, and that recording is should be available.

16:27.880 --> 16:30.900 So for that, and, and, and actually it works.

16:30.900 --> 16:44.520 this is pure MySQL open source, pure MyVector open source and pure MCP server open source with the, you know, clinical trials data, like the, one of the public data data sets that I've used.

16:44.520 --> 16:52.300 you can actually ask questions and it'll answer, and then you can continue the chat using this, this, very given technology.

16:52.300 --> 16:54.060 That's awesome.

16:54.060 --> 17:00.060 I've actually been playing around with, writing my own MCP servers and having them interact with MySQL.

17:00.060 --> 17:14.880 And I think MCP is going to be, is going to wind up being pretty, being pretty big because it does give that, domain specific context that the LLMs can actually use to generate content or answers or whatever.

17:14.880 --> 17:18.120 So I, I'm, I'm interested to play around with that.

17:19.000 --> 17:19.520 Absolutely.

17:19.740 --> 17:20.120 Absolutely.

17:20.120 --> 17:23.100 This is very interesting development, very recent.

17:23.740 --> 17:28.440 a lot of people are experimenting right now with the MCP servers.

17:29.160 --> 17:44.300 MCP servers, are going to be, I think the salt and pepper of, or, or sauce of, of this technology, you know, having the vectors and beddings and the, you know, n-search, HNSW index and the MCP server.

17:44.300 --> 17:47.220 So they, it's going to complete the puzzle in my opinion.

17:47.600 --> 17:50.920 And, and also, like I said, I've already given a demo.

17:51.060 --> 17:53.840 We are also looking into this, this technology.

17:54.120 --> 17:59.220 and, of course there's, that is also still under development.

17:59.740 --> 18:08.980 if you're opening up for a public MCP, then it's actually your, your security and compliance is, is now outside again.

18:08.980 --> 18:16.200 We want to do, if you want to do everything internally, if you want to do everything in your own database, with your own security and compliance.

18:16.700 --> 18:20.660 So that's a game changer to have something, available for yourself.

18:21.260 --> 18:21.660 Excellent.

18:21.840 --> 18:23.140 So thank you very much, Alkin.

18:23.660 --> 18:25.060 thank you for your time.

18:25.460 --> 18:27.740 We know you are on your boat right now, sailing.

18:27.960 --> 18:28.900 So that's awesome.

18:29.180 --> 18:29.580 Thank you.

18:29.640 --> 18:30.420 Thanks a lot, guys.

18:30.600 --> 18:31.120 Thank you, Alkin.

18:31.180 --> 18:33.120 And thank you for all your contributions to the community.

18:33.120 --> 18:36.880 That's a wrap on this episode of Inside MySQL: Sakila Speaks.

18:37.040 --> 18:38.280 Thanks for hanging out with us.

18:38.560 --> 18:42.320 If you enjoyed listening, please click subscribe to get all the latest episodes.

18:42.600 --> 18:45.540 We would also love your reviews and ratings on your podcast app.

18:45.860 --> 18:50.080 Be sure to join us for the next episode of Inside MySQL: Sakila Speaks.

  continue reading

15 Episoden

Alle Folgen

×
 
Loading …

Willkommen auf Player FM!

Player FM scannt gerade das Web nach Podcasts mit hoher Qualität, die du genießen kannst. Es ist die beste Podcast-App und funktioniert auf Android, iPhone und im Web. Melde dich an, um Abos geräteübergreifend zu synchronisieren.

 

Kurzanleitung

Hören Sie sich diese Show an, während Sie die Gegend erkunden
Abspielen