Tutorial: Vector Search and Similarity in Postgres

Software engineers occupy an exciting place in this world. Regardless of technology or industry, our task is to solve problems that directly contribute to the goals of our employers. As a bonus, we get the ability to use technology to mitigate any challenges that come our way.

For this example, I wanted to focus on how pgvector — an open source vector similarity search for Postgres — can be used to identify data similarities that exist in enterprise data.

A simple use case

As a simple example, let’s say the marketing department needs help with a campaign they plan to launch. The goal is to reach all Salesforce accounts that are in industries closely related to the software industry.

Ultimately, they would like to focus on accounts in the top three most similar industries, with the possibility of using this tool in the future to find similarities for other industries. If possible, they would like the option to provide the desired number of matching industries, rather than always returning the top three.

High level design

This use case focuses on performing a similarity search. Although it is possible to complete this exercise manually, the Wikipedia2Vec tool comes to mind because of the pre-trained inserts already created for multiple languages. Inserting words — also known as vectors — are numerical representations of words that contain their syntactic and semantic information. By representing words as vectors, we can mathematically determine which words are semantically “closer” to others.

In our example, we could also write a simple Python program to create word vectors for each industry configured in Salesforce.

The pgvector the extension requires a Postgres database. However, the business data for our example currently resides in Salesforce. Fortunately, Heroku Connect provides an easy way to sync Salesforce accounts with Heroku Postgres, storing them in a table called salesforce.account. Then we’ll call another table salesforce.industries containing each industry in Salesforce (as a VARCHAR key), along with its associated vector of words.

With Salesforce data and word vectors in Postgres, we will create a RESTful API using Java and Spring Boot. This service will perform the necessary queries, returning the results in JSON format.

We can illustrate a high-level view of the solution like this:

The source code will be in GitLab. Issue a git push heroku command will trigger the deployment in Heroku, introducing a RESTful API that the marketing team can easily use.

Creation of a solution

With the high-level design in place, we can start building the solution. Using my Salesforce login, I was able to get to Accounts screen to display the data for this exercise. Here’s an example of a company’s first page:

Create a Heroku application

For this effort, I planned to use Heroku to handle the marketing team’s requests. I logged into my Heroku account and used Create a new application the button for establishing a new application, the so-called similarity-search-sfdc:

After creating the application, I went to the Resources tab to find the Heroku Postgres plugin. I typed “Postgres” into the plugins search field.

After selecting Heroku Postgres I chose from the list Default 0 plan, but pgvector is available in standard (or higher) tier database offerings running PostgreSQL 15 or a base tier beta database.

When I verified the plugin, Heroku generated and served it DATABASE_URL string of connections. I found this in Configuration variants section of settings my app tab. I used this information to connect to my database and enabled the pgvector extension like this:

Then I searched and found Heroku Connect addition. I knew this would give me an easy way to connect to the company’s data in Salesforce.

For this exercise, free Demo edition the plan works just fine.

In this moment, Resources card for similarity-search-sfdc the application looked like this:

I followed the “Setting Up Heroku Connect” instructions to connect my Salesforce account to Heroku Connect. Then I chose Account synchronization object. Once complete, I could see the same Salesforce account data in Heroku Connect and in the underlying Postgres database.

From a SQL perspective, what I did resulted in the creation of a salesforce.account table with the following design:

create table salesforce.account
(
    createddate    timestamp,
    isdeleted      boolean,
    name           varchar(255),
    systemmodstamp timestamp,
    accountnumber  varchar(40),
    industry       varchar(255),
    sfid           varchar(18),
    id             serial
        primary key,
    _hc_lastop     varchar(32),
    _hc_err        text
);

Generate vectors

In order for the similarity search to work as expected, I had to generate word vectors for each Salesforce account industry:

  • Apparel
  • Banking
  • Biotechnology
  • Construction
  • Education
  • Electronics
  • Engineering
  • Fun
  • Food and drink
  • Finances
  • Government
  • health care
  • Hospitality
  • Insurance
  • Media
  • Not for profit
  • the rest
  • Recreation
  • Retail
  • delivery
  • Technology
  • Telecommunications
  • Transport
  • Utilities

Since the primary use case indicated the need to find similarities for the software industry, we should generate a word vector for that industry as well.

To keep things simple for this exercise, I manually performed this task using Python 3.9 and a file named embed.pywhich looks like this:

from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load('enwiki_20180420_100d.pkl')
print(wiki2vec.get_word_vector('software').tolist())

Keep in mind – get_word_vector() the method expects industry representation in lowercase letters.

Starting python embed.py generated the following word vector for software word:

[-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804, 
0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243, 
0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315, 
-0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384, 
0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905, 
-0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829, 
0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818, 
-0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664, 
0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612, 
-0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511, 
-0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679, 
0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724, 
-0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787, 
-0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221, 
-0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849, 
-0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517, 
-0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596, 
0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259, 
-0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406, 
-0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419]

Create a table for industries

In order to store word vectors, we had to add industries table into the Postgres database using the following SQL command:

create table salesforce.industries
(
    name varchar not null constraint industries_pk primary key,
    embeddings vector(100) not null
);

with industries table created, we will insert each of the generated word vectors. We do this using SQL statements similar to the following:

INSERT INTO salesforce.industries
(name, embeddings)
VALUES
    ('Software','[-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804, 0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243, 0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315, -0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384, 0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905, -0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829, 0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818, -0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664, 0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612, -0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511, -0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679, 0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724, -0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787, -0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221, -0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849, -0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517, -0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596, 0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259, -0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406, -0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419]
');

Please note – although we have created a vector of lowercase words representing the software industry (software), industries.name the column must correspond to the industry name (software) written in capital letters.

After all the generated word vectors have been added to industries table, we can change the focus to introducing a RESTful API.

Introduce Spring Boot Service

This was the point where my passion as a software engineer kicked into high gear, as I had everything ready to tackle the challenge at hand.

Then, using Spring Boot 3.2.2 and Java (temurin) 17, I created similarity-search-sfdc project in IntelliJ IDEA with the following Maven dependencies:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-jpa</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>com.pgvector</groupId>
        <artifactId>pgvector</artifactId>
        <version>0.1.4</version>
    </dependency>
    <dependency>
        <groupId>org.postgresql</groupId>
        <artifactId>postgresql</artifactId>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-configuration-processor</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

I created simplified entities for both Account object i Industry (embedding) object, which is aligned with the Postgres database tables created earlier.

@AllArgsConstructor
@NoArgsConstructor
@Data
@Entity
@Table(name = "account", schema = "salesforce")
public class Account 
    @Id
    @Column(name = "sfid")
    private String id;
    private String name;
    private String industry;


@AllArgsConstructor
@NoArgsConstructor
@Data
@Entity
@Table(name = "industries", schema = "salesforce")
public class Industry 
    @Id
    private String name;

Using the JpaRepository interface, I added the following extensions to allow easy access to Postgres tables:

public interface AccountsRepository extends JpaRepository<Account, String> 
    @Query(nativeQuery = true, value = "SELECT sfid, name, industry " +
            "FROM salesforce.account " +
            "WHERE industry IN (SELECT name " +
            "  FROM salesforce.industries " +
            "  WHERE name != :industry " +
            "  ORDER BY embeddings <-> (SELECT embeddings 
                  FROM salesforce.industries 
                  WHERE name = :industry) " + 
            "  LIMIT :limit)" +
            "ORDER BY name")
    Set<Account> findSimilaritiesForIndustry(String industry, int limit);


public interface IndustriesRepository extends JpaRepository<Industry, String>  

Notice findSimilaritiesForIndustry() The method is where all the hard work will take place to solve this use case. The method will accept the following parameters:

  • industry: an industry for which similarities should be found
  • limit: the maximum number of industry similarities to search against when searching for accounts

Note the Euclidean distance operator (<->) in our query above. This is a built-in extension operator to perform a similarity search.

With the original use case of “Software” in industry and the limit set to the three closest industries, the query executed would look like this:

SELECT sfid, name, industry 
FROM salesforce.account 
WHERE industry 
  IN (SELECT name 
      FROM salesforce.industries 
      WHERE name != 'Software' 
      ORDER BY embeddings 
        <-> (SELECT embeddings 
             FROM salesforce.industries 
             WHERE name="Software") 
      LIMIT 3)
ORDER BY name;

I built from there AccountsService class to interact with JPA repositories:

@RequiredArgsConstructor
@Service
public class AccountsService 
    private final AccountsRepository accountsRepository;
    private final IndustriesRepository industriesRepository;

    public Set<Account> getAccountsBySimilarIndustry(String industry, 
                                                     int limit) 
                                                     throws Exception 
        List<Industry> industries = industriesRepository.findAll();

        if (industries
                     .stream()
                     .map(Industry::getName)
                     .anyMatch(industry::equals)) 
            return accountsRepository
                      .findSimilaritiesForIndustry(industry, limit);
         else 
            throw new Exception(
               "Could not locate '" + industry + "' industry");
        
    

In the end, I had AccountsController classes to provide a RESTful entry point and connect to AccountsService:

@RequiredArgsConstructor
@RestController
@RequestMapping(value = "/accounts")
public class AccountsController 
    private final AccountsService accountsService;

    @GetMapping(value = "/similarities")
    public ResponseEntity<Set<Account>> getAccountsBySimilarIndustry(@RequestParam String industry, @RequestParam int limit) 
        try 
            return new ResponseEntity<>(
                  accountsService
                      .getAccountsBySimilarIndustry(industry, limit),
                  HttpStatus.OK);
         catch (Exception e) 
            return new ResponseEntity<>(HttpStatus.NOT_FOUND);
        
    

Deploy to Heroku

With the Spring Boot service ready, I added the following Procfile project, letting Heroku learn more about our service:

web: java $JAVA_OPTS -Dserver.port=$PORT -jar target/*.jar

Just to be sure, I added system.properties to specify which versions of Java and Maven are expected:

java.runtime.version=17
maven.version=3.9.5

Using the Heroku CLI, I added the remote to my GitLab repository for similarity-search-sfdc service for the Heroku platform:

heroku git:remote -a similarity-search-sfdc

I also set the buildpack type for similarity-search-sfdc service via the following command:

heroku buildpacks:set https://github.com/heroku/heroku-buildpack-java  

I finally deployed similarity-search-sfdc service to Heroku using the following command:

Now, Resources card for similarity-search-sfdc the application appeared as shown below:

Looking for similarities in action

With the RESTful API running, I issued the following cURL command to locate the top three Salesforce industries (and associated accounts) that are closest to Software industry:

curl --location 'https://HEROKU-APP-ROOT-URL/accounts/similarities?industry=Software&limit=3'

The RESTful API returns a 200 OK HTTP response status along with the following content:

[
    
        "id": "001Kd00001bsP80IAE",
        "name": "CleanSlate Technology Group",
        "industry": "Technology"
    ,
    
        "id": "001Kd00001bsPBFIA2",
        "name": "CMG Worldwide",
        "industry": "Media"
    ,
    
        "id": "001Kd00001bsP8AIAU",
        "name": "Dev Spotlight",
        "industry": "Technology"
    ,
    
        "id": "001Kd00001bsP8hIAE",
        "name": "Egghead",
        "industry": "Electronics"
    ,
    
        "id": "001Kd00001bsP85IAE",
        "name": "Marqeta",
        "industry": "Technology"
    
]

As a result, Technology, Mediaand Electronics industries are the closest industries Software industry in this example.

Now the marketing department has a list of accounts they can contact for their next campaign.

Conclusion

Many years ago I spent more time than I care to admit playing the multiplayer video game Team Fortress 2. Here’s a screenshot from the 2012 event, which was a lot of fun:

Those familiar with this aspect of my life could tell you that my default choice of player class was a soldier. This is because the soldier has the best balance of health, movement, speed and firepower.

I feel that software engineers are the “soldier class” of the real world because we can adapt to any situation and focus on delivering solutions that meet expectations in an efficient manner.

For several years now, I have focused on the following mission statement, which I believe can be applied to any IT professional:

“Focus your time on providing features/functionality that increase the value of your intellectual property. Use frameworks, products and services for everything else.”

– J. Vester

In the example for this post, we were able to use Heroku Connect to synchronize business data with a Postgres database. After installing the pgvector extension, we created word vectors for each unique industry from those Salesforce accounts. Finally, we introduced the Spring Boot service, which simplified the process of locating Salesforce accounts whose industry was closest to another industry.

We quickly solved this use case with existing open source technologies, adding a tiny Spring Boot service and Heroku PaaS — fully adhering to my mission statement. I can’t imagine how much time it would take without these frameworks, products and services.

If you are interested, the original source code for this article can be found on GitLab.

Have a nice day!

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *