Semantic search with the Weaviate Vector database

The previous blog discussed the impact of document format and how it is incorporated in conjunction with semantic search. LangChain4j was used for this. The way the document is embedded has a big impact on the results. That was one of the main conclusions. However, a perfect result was not achieved.

In this post, you’ll take a look at Weaviate, a vector database that has a Java client library available. You will investigate whether better results can be achieved.

The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting thing about these documents is that they contain facts and that they are mostly in tabular form. Parts of these documents are converted to Markdown for better display. The same documents were used in the previous blog, so it will be interesting to see how the findings from that post compare to the approach used in this post.

The sources used in this blog can be found on GitHub.

Prerequisites

The prerequisites for this blog are:

  • Basic knowledge of embedding and vector storage
  • Basic knowledge of Java, Java 21 is used
  • Basic knowledge of Docker

The Weaviate Getting Started Guides are also interesting reading material.

How to implement vector similarity search

1. Installing Weaviate

There are several ways to install Weaviate. Simple installation is via Docker Compose. Just use the Docker Compose sample file.

version: '3.4'
services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: semitechnologies/weaviate:1.23.2
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: 'text2vec-cohere,text2vec-huggingface,text2vec-palm,text2vec-openai,generative-openai,generative-cohere,generative-palm,ref2vec-centroid,reranker-cohere,qna-openai'
      CLUSTER_HOSTNAME: 'node1'
volumes:
  weaviate_data:

Run the Compose file from the root of the repository.

$ docker compose -f docker/compose-initial.yaml up

You can turn it off using CTRL+C or the following command:

$ docker compose -f docker/compose-initial.yaml down

2. Connect to Weaviate

First, let’s try to connect to Weaviate via the Java library. Add the following dependency to the pom file:

<dependency>
  <groupId>io.weaviate</groupId>
  <artifactId>client</artifactId>
  <version>4.5.1</version>
</dependency>

The following code will create a connection to Weaviate and display some metadata about the instance.

Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
Result<Meta> meta = client.misc().metaGetter().run();
if (meta.getError() == null) 
    System.out.printf("meta.hostname: %s\n", meta.getResult().getHostname());
    System.out.printf("meta.version: %s\n", meta.getResult().getVersion());
    System.out.printf("meta.modules: %s\n", meta.getResult().getModules());
 else 
    System.out.printf("Error: %s\n", meta.getError().getMessages());

The output is as follows:

meta.hostname: http://[::]:8080
meta.version: 1.23.2
meta.modules: generative-cohere=documentationHref=https://docs.cohere.com/reference/generate, name=Generative Search - Cohere, generative-openai=documentationHref=https://platform.openai.com/docs/api-reference/completions, name=Generative Search - OpenAI, generative-palm=documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/chat/test-chat-prompts, name=Generative Search - Google PaLM, qna-openai=documentationHref=https://platform.openai.com/docs/api-reference/completions, name=OpenAI Question & Answering Module, ref2vec-centroid=, reranker-cohere=documentationHref=https://txt.cohere.com/rerank/, name=Reranker - Cohere, text2vec-cohere=documentationHref=https://docs.cohere.ai/embedding-wiki/, name=Cohere Module, text2vec-huggingface=documentationHref=https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task, name=Hugging Face Module, text2vec-openai=documentationHref=https://platform.openai.com/docs/guides/embeddings/what-are-embeddings, name=OpenAI Module, text2vec-palm=documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings, name=Google PaLM Module

The version and modules activated are shown, this corresponds to the modules activated in the docker build file.

3. Embed the documents

To be able to search for documents, the documents must first be embedded. This can be done using the text2vec-transformers module. Create a new Docker Compose file with just text2vec-transformers module enabled. You have also set this module as DEFAULT_VECTORIZER_MODULESet up TRANSFORMERS_INFERENCE_API into the transformer tank and use sentence-transformers-all-MiniLM-L6-v2-onnx image for transformer tank. Use ONNX image when not using GPU.

version: '3.4'
services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: semitechnologies/weaviate:1.23.2
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2-onnx
volumes:
  weaviate_data:

Start the containers:

$ docker compose -f docker/compose-embed.yaml up

Embedding data is an important step that needs to be done thoroughly. Therefore, it is important to know Weaviate concepts.

  • Each data object belongs to a Classand the class has one or more Properties.
  • AND Class can be viewed as a collection and each data object (represented as a JSON-document) can be represented by a vector (i.e. embedding).
  • Each Class contains objects belonging to this class, corresponding to the common schema.

There are three Bruce Springsteen data files available. The installation will be carried out as follows:

  • Any markdown file will be converted to Weaviate Class.
  • A Markdown file consists of headers. The header contains the names of the columns that will be converted to Weaviate Properties. Properties must be valid GraphQL names. Therefore, the names of the columns have been slightly changed compared to the previous blog. For example writer(s) has become writers, album details has become AlbumDetailsetc.
  • Data is present after the header. Each row in the table will be converted to a data object belonging to a Class.

An example of a tag file is a compilation album file.

| Title                            | US | AUS | CAN | GER | IRE | NLD | NZ | NOR | SWE | UK |
|----------------------------------|----|-----|-----|-----|-----|-----|----|-----|-----|----|
| Greatest Hits                    | 1  | 1   | 1   | 1   | 1   | 2   | 1  | 1   | 1   | 1  |
| Tracks                           | 27 | 97  | —   | 63  | —   | 36  | —  | 4   | 11  | 50 |
| 18 Tracks                        | 64 | 98  | 58  | 8   | 20  | 69  | —  | 2   | 1   | 23 |
| The Essential Bruce Springsteen  | 14 | 41  | —   | —   | 5   | 22  | —  | 4   | 2   | 15 |
| Greatest Hits                    | 43 | 17  | 21  | 25  | 2   | 4   | 3  | 3   | 1   | 3  |
| The Promise                      | 16 | 22  | 27  | 1   | 4   | 4   | 30 | 1   | 1   | 7  |
| Collection: 1973–2012            | —  | 6   | —   | 23  | 2   | 78  | 19 | 1   | 6   | —  |
| Chapter and Verse                | 5  | 2   | 21  | 4   | 2   | 5   | 4  | 3   | 2   | 2  |

The following sections explain in more detail the steps taken to embed documents. The complete source code is available on GitHub. This isn’t the cleanest code, but I hope it’s understandable.

3.1 Basic setup

A map has been created containing the file names associated with Weaviate Class names to be used.

private static Map<String, String> documentNames = Map.of(
            "bruce_springsteen_list_of_songs_recorded.md", "Songs",
            "bruce_springsteen_discography_compilation_albums.md", "CompilationAlbums",
            "bruce_springsteen_discography_studio_albums.md", "StudioAlbums");

In basic settings, the connection is set to Weaviate, all data is removed from the database, and files are read. Each file is then processed one at a time.

Config config = new Config("http", "localhost:8080");
WeaviateClient client = new WeaviateClient(config);
 
// Remove existing data
Result<Boolean> deleteResult = client.schema().allDeleter().run();
if (deleteResult.hasErrors()) 
    System.out.println(new GsonBuilder().setPrettyPrinting().create().toJson(deleteResult.getResult()));

 
List<Document> documents = loadDocuments(toPath("markdown-files"));
 
for (Document document : documents) 
    ...

3.2 Convert the header to a class

Header data must be converted to Weaviate Class.

  • Split the entire file line by line.
  • The first line contains the header, split it with | separator and store it in a variable tempSplittedHeader.
  • The header starts with | and therefore the first entry into tempSplittedHeader it’s empty. Remove it and store the rest of the line in a variable splittedHeader.
  • For each item in splittedHeader (ie column names), and Weaviate Property is created. Remove all leading and trailing spaces from the data.
  • Create a Weaviate documentClass with the class name as defined in documentNames map and just created Properties.
  • Add the class to the schema and check the result.
// Split the document line by line
String[] splittedDocument = document.text().split("\n");
 
// split the header on | and remove the first item (the line starts with | and the first item is therefore empty)
String[] tempSplittedHeader = splittedDocument[0].split("\\|");
String[] splittedHeader = Arrays.copyOfRange(tempSplittedHeader,1, tempSplittedHeader.length);
 
// Create the Weaviate collection, every item in the header is a Property
ArrayList<Property> properties = new ArrayList<>();
for (String splittedHeaderItem : splittedHeader) 
    Property property = Property.builder().name(splittedHeaderItem.strip()).build();
    properties.add(property);

 
WeaviateClass documentClass = WeaviateClass.builder()
        .className(documentNames.get(document.metadata("file_name")))
        .properties(properties)
        .build();
 
// Add the class to the schema
Result<Boolean> collectionResult = client.schema().classCreator()
        .withClass(documentClass)
        .run();
if (collectionResult.hasErrors()) 
    System.out.println("Creation of collection failed: " + documentNames.get(document.metadata("file_name")));

3.3 Convert rows of data into objects

Each data row needs to be converted to a Weaviate data object.

  • Copy the lines containing the data into the variable dataOnly.
  • Loop through each row, the row is represented by a variable documentLine.
  • Split each line with | separator and store it in a variable tempSplittedDocumentLine.
  • Just like the header, each line starts with |, and thus the first entry tempSplittedDocumentLine it’s empty. Remove it and store the rest of the line in a variable splittedDocumentLine.
  • Each item in the line becomes a property. The entire row is converted to properties in a variable propertiesDocumentLine. Remove all leading and trailing spaces from the data.
  • Add a data object to the class and check the result.
  • Finally, print the result.
// Preserve only the rows containing data, the first two rows contain the header
String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length);
 
for (String documentLine : dataOnly) 
    // split a data row on 

3.4 Result

Running the document embedding code prints what is stored in the Weaviate vector database. As you can see below, the data object has a UUIDclass is StudioAlbumsthe Properties are listed and the corresponding vector is shown.


  "id": "e0d5e1a3-61ad-401d-a264-f95a9a901d82",
  "class": "StudioAlbums",
  "creationTimeUnix": 1705842658470,
  "lastUpdateTimeUnix": 1705842658470,
  "properties": 
    "aUS": "3",
    "cAN": "8",
    "gER": "1",
    "iRE": "2",
    "nLD": "1",
    "nOR": "1",
    "nZ": "4",
    "sWE": "1",
    "title": "Only the Strong Survive",
    "uK": "2",
    "uS": "8"
  ,
  "vector": [
    -0.033715352,
    -0.07489116,
    -0.015459526,
    -0.025204511,
   ...
    0.03576842,
    -0.010400549,
    -0.075309984,
    -0.046005197,
    0.09666792,
    0.0051724687,
    -0.015554721,
    0.041699238,
    -0.09749843,
    0.052182134,
    -0.0023900834
  ]

4. Management of collections

So now you have data in a vector database. What information can be retrieved from the database? For example, you can manage a collection.

4.1 Retrieving a collection definition

The collection definition can be found as follows:

String className = "CompilationAlbums";
 
Result<WeaviateClass> result = client.schema().classGetter()
        .withClassName(className)
        .run();
 
String json = new GsonBuilder().setPrettyPrinting().create().toJson(result.getResult());
System.out.println(json);

The output is as follows:


  "class": "CompilationAlbums",
  "description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024",
  "invertedIndexConfig": 
    "bm25": 
      "k1": 1.2,
      "b": 0.75
    ,
    "stopwords": 
      "preset": "en"
    ,
    "cleanupIntervalSeconds": 60
  ,
  "moduleConfig": 
    "text2vec-transformers": 
      "poolingStrategy": "masked_mean",
      "vectorizeClassName": true
    
  ,
  "properties": [
    
      "name": "uS",
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024",
      "tokenization": "word",
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": 
        "text2vec-transformers": 
          "skip": false,
          "vectorizePropertyName": false
        
      
    ,
    ...

You can see how it is vectorized, properties, etc.

4.2 Retrieving collection objects

Can you also retrieve collection objects? Yes you can, but it is not possible at the time of writing with the Java client library. When looking through the Weaviate documentation you will notice that there is no sample code for the java client library. However, you can use the GraphQL API which can also be called from java code. Retrieval code title property of each data object in the CompilationAlbums The class is as follows:

  • You call graphQL method from the Weaviate client.
  • You define the Weaviate Class and the fields you want to retrieve.
  • You print the result.
Field song = Field.builder().name("title").build();
 
Result<GraphQLResponse> result = client.graphQL().get()
        .withClassName("CompilationAlbums")
        .withFields(song)
        .run();
if (result.hasErrors()) 
    System.out.println(result.getError());
    return;

System.out.println(result.getResult());

The result shows all titles:

GraphQLResponse(
  data=
    Get=
      CompilationAlbums=[
        title=Chapter and Verse, 
        title=The Promise, 
        title=Greatest Hits, 
        title=Tracks, 
        title=18 Tracks, 
        title=The Essential Bruce Springsteen, 
        title=Collection: 1973–2012, 
        title=Greatest Hits
      ]
    
  , 
errors=null)

5. Semantic search

The whole point of embedding documents is to check if you can search the documents. To search, you must also use the GraphQL API. Various search operators are available. As in the previous blog, 5 questions are asked about the data.

  1. what album was “adam raised a cain” originally released on?
    The answer is “Darkness on the edge of the city”.
  2. what is the highest ranking position of “Greetings from Asbury Park, NJ” in the USA?
    This answer is #60.
  3. what is the highest position of album “tracks” in canada?
    The album did not chart in Canada.
  4. what year was “Highway Patrolman” released?
    The answer is 1982.
  5. which produced “all or nothing”?
    The answer is Jon Landau, Chuck Plotkin, Bruce Springsteen and Roy Bittan.

In the source code, you specify the name of the class and the corresponding fields. This information is added to the static class for each collection. The code contains the following:

  • Create a connection with Weaviate.
  • Add the class fields and also add two additional fields, the safety and distance.
  • Embed the question using a NearTextArgument.
  • Search the collection via GraphQL API, limit the result to 1.
  • Print the result.
private static void askQuestion(String className, Field[] fields, String question) 
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);
 
    Field additional = Field.builder()
            .name("_additional")
            .fields(Field.builder().name("certainty").build(), // only supported if distance==cosine
                    Field.builder().name("distance").build()   // always supported
            ).build();
    Field[] allFields = Arrays.copyOf(fields, fields.length + 1);
    allFields[fields.length] = additional;
 
    // Embed the question
    NearTextArgument nearText = NearTextArgument.builder()
            .concepts(new String[]question)
            .build();
 
    Result<GraphQLResponse> result = client.graphQL().get()
            .withClassName(className)
            .withFields(allFields)
            .withNearText(nearText)
            .withLimit(1)
            .run();
 
    if (result.hasErrors()) 
        System.out.println(result.getError());
        return;
    
    System.out.println(result.getResult());

Call this method for five questions.

askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?");
askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?");
askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "what is the highest chart position of the album \"tracks\" in canada?");
askQuestion(Song.NAME, Song.getFields(), "in which year was \"Highway Patrolman\" released?");
askQuestion(Song.NAME, Song.getFields(), "who produced \"all or nothin' at all?\"");

The result is amazing, for all five questions the correct data object is returned.

GraphQLResponse(
  data=
    Get=
      Songs=[
        _additional=certainty=0.7534831166267395, distance=0.49303377, 
         originalRelease=Darkness on the Edge of Town, 
         producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), 
         song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978
      ]
     
  , 
  errors=null)
GraphQLResponse(
  data=
    Get=
      StudioAlbums=[
        _additional=certainty=0.803815484046936, distance=0.39236903, 
         aUS=71, 
         cAN=—, 
         gER=—, 
         iRE=—, 
         nLD=—, 
         nOR=—, 
         nZ=—, 
         sWE=35, 
         title=Greetings from Asbury Park,N.J., uK=41, uS=60
      ]
    
  , 
  errors=null)
GraphQLResponse(
  data=
    Get=
      CompilationAlbums=[
        _additional=certainty=0.7434340119361877, distance=0.513132, 
         aUS=97, 
         cAN=—, 
         gER=63, 
         iRE=—, 
         nLD=36, 
         nOR=4, 
         nZ=—, 
         sWE=11, 
         title=Tracks, 
         uK=50, 
         uS=27
      ]
    
  , 
  errors=null)
GraphQLResponse(
  data=
    Get=
      Songs=[
        _additional=certainty=0.743279218673706, distance=0.51344156, 
         originalRelease=Nebraska, 
         producers=Bruce Springsteen, 
         song="Highway Patrolman", 
         writers=Bruce Springsteen, 
         year=1982
      ]
    
  , 
  errors=null)
GraphQLResponse(
  data=
    Get=
      Songs=[
        _additional=certainty=0.7136414051055908, distance=0.5727172, 
         originalRelease=Human Touch, 
         producers=Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan, 
         song="All or Nothin' at All", 
         writers=Bruce Springsteen, 
         year=1992
      ]
    
  , 
  errors=null)

6. Explore the collections

The semantic search implementation assumed that you knew which collection to search for an answer. Most of the time you don’t know which collection to look for. The search function can help search multiple collections. There are some limitations to using the research function:

  • Only one vectorizer module can be enabled.
  • Vector search must be nearText or nearVector.

The askQuestion method becomes the following. Just like in the previous paragraph, you want to return some additional, more general collection fields. The question is embedded in NearTextArgument and the collections are being researched.

private static void askQuestion(String question) 
    Config config = new Config("http", "localhost:8080");
    WeaviateClient client = new WeaviateClient(config);
 
    ExploreFields[] fields = new ExploreFields[]
            ExploreFields.CERTAINTY,  // only supported if distance==cosine
            ExploreFields.DISTANCE,   // always supported
            ExploreFields.BEACON,
            ExploreFields.CLASS_NAME
    ;
 
    NearTextArgument nearText = NearTextArgument.builder().concepts(new String[]question).build();
 
    Result<GraphQLResponse> result = client.graphQL().explore()
            .withFields(fields)
            .withNearText(nearText)
            .run();
 
    if (result.hasErrors()) 
        System.out.println(result.getError());
        return;
    
    System.out.println(result.getResult());

Running this code returns an error. An error was reported because an obscure error was returned.

GraphQLResponse(data=Explore=null, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data=Explore=null, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data=Explore=null, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data=Explore=null, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])
GraphQLResponse(data=Explore=null, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])])

However, to work around this error, it would be interesting to check if the correct answer returns the largest safety across all collections. Therefore, for each question, each collection is examined. You can find the full code here, only the code for question 1 is shown below. The askQuestion implementation is the one used in Semantic search paragraph.

private static void question1() 
    askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?");
    askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?");
    askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?");

Running this code returns the following output.

GraphQLResponse(data=Get=Songs=[_additional=certainty=0.7534831166267395, distance=0.49303377, originalRelease=Darkness on the Edge of Town, producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978], errors=null)
GraphQLResponse(data=Get=StudioAlbums=[_additional=certainty=0.657206118106842, distance=0.68558776, aUS=9, cAN=7, gER=—, iRE=73, nLD=4, nOR=12, nZ=11, sWE=9, title=Darkness on the Edge of Town, uK=14, uS=5], errors=null)
GraphQLResponse(data=Get=CompilationAlbums=[_additional=certainty=0.6488107144832611, distance=0.7023786, aUS=97, cAN=—, gER=63, iRE=—, nLD=36, nOR=4, nZ=—, sWE=11, title=Tracks, uK=50, uS=27], errors=null)

The interesting parts here are certainties:

  • The collection of songs has a confidence of 0.75
  • Collection StudioAlbums has a security of 0.62
  • Collection CompilationAlbums has a security of 0.64

The correct answer is found in the collection of songs that has the highest certainty. So this is great. When you check this for other questions, you will see that the collection containing the correct answer always has the highest certainty.

Conclusion

In this post, you have transformed the original documents to fit into a vector database. Semantic search results are amazing. In the previous posts, it was a lot of trouble to get the correct answers to the questions. By restructuring the data and using only vector semantic search, a 100% result of correct answers was achieved.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *