Software engineers occupy an exciting place in this world. Regardless of technology or industry, our task is to solve problems that directly contribute to the goals of our employers. As a bonus, we get the ability to use technology to mitigate any challenges that come our way.
For this example, I wanted to focus on how pgvector — an open source vector similarity search for Postgres — can be used to identify data similarities that exist in enterprise data.
A simple use case
As a simple example, let’s say the marketing department needs help with a campaign they plan to launch. The goal is to reach all Salesforce accounts that are in industries closely related to the software industry.
Ultimately, they would like to focus on accounts in the top three most similar industries, with the possibility of using this tool in the future to find similarities for other industries. If possible, they would like the option to provide the desired number of matching industries, rather than always returning the top three.
High level design
This use case focuses on performing a similarity search. Although it is possible to complete this exercise manually, the Wikipedia2Vec tool comes to mind because of the pre-trained inserts already created for multiple languages. Inserting words — also known as vectors — are numerical representations of words that contain their syntactic and semantic information. By representing words as vectors, we can mathematically determine which words are semantically “closer” to others.
In our example, we could also write a simple Python program to create word vectors for each industry configured in Salesforce.
The pgvector
the extension requires a Postgres database. However, the business data for our example currently resides in Salesforce. Fortunately, Heroku Connect provides an easy way to sync Salesforce accounts with Heroku Postgres, storing them in a table called salesforce.account
. Then we’ll call another table salesforce.industries
containing each industry in Salesforce (as a VARCHAR key), along with its associated vector of words.
With Salesforce data and word vectors in Postgres, we will create a RESTful API using Java and Spring Boot. This service will perform the necessary queries, returning the results in JSON format.
We can illustrate a high-level view of the solution like this:
The source code will be in GitLab. Issue a git push heroku
command will trigger the deployment in Heroku, introducing a RESTful API that the marketing team can easily use.
Creation of a solution
With the high-level design in place, we can start building the solution. Using my Salesforce login, I was able to get to Accounts screen to display the data for this exercise. Here’s an example of a company’s first page:
Create a Heroku application
For this effort, I planned to use Heroku to handle the marketing team’s requests. I logged into my Heroku account and used Create a new application the button for establishing a new application, the so-called similarity-search-sfdc
:
After creating the application, I went to the Resources tab to find the Heroku Postgres plugin. I typed “Postgres” into the plugins search field.
After selecting Heroku Postgres I chose from the list Default 0 plan, but pgvector is available in standard (or higher) tier database offerings running PostgreSQL 15 or a base tier beta database.
When I verified the plugin, Heroku generated and served it DATABASE_URL
string of connections. I found this in Configuration variants section of settings my app tab. I used this information to connect to my database and enabled the pgvector extension like this:
Then I searched and found Heroku Connect addition. I knew this would give me an easy way to connect to the company’s data in Salesforce.
For this exercise, free Demo edition the plan works just fine.
In this moment, Resources card for similarity-search-sfdc
the application looked like this:
I followed the “Setting Up Heroku Connect” instructions to connect my Salesforce account to Heroku Connect. Then I chose Account synchronization object. Once complete, I could see the same Salesforce account data in Heroku Connect and in the underlying Postgres database.
From a SQL perspective, what I did resulted in the creation of a salesforce.account
table with the following design:
create table salesforce.account
(
createddate timestamp,
isdeleted boolean,
name varchar(255),
systemmodstamp timestamp,
accountnumber varchar(40),
industry varchar(255),
sfid varchar(18),
id serial
primary key,
_hc_lastop varchar(32),
_hc_err text
);
Generate vectors
In order for the similarity search to work as expected, I had to generate word vectors for each Salesforce account industry:
- Apparel
- Banking
- Biotechnology
- Construction
- Education
- Electronics
- Engineering
- Fun
- Food and drink
- Finances
- Government
- health care
- Hospitality
- Insurance
- Media
- Not for profit
- the rest
- Recreation
- Retail
- delivery
- Technology
- Telecommunications
- Transport
- Utilities
Since the primary use case indicated the need to find similarities for the software industry, we should generate a word vector for that industry as well.
To keep things simple for this exercise, I manually performed this task using Python 3.9 and a file named embed.py
which looks like this:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load('enwiki_20180420_100d.pkl')
print(wiki2vec.get_word_vector('software').tolist())
Keep in mind – get_word_vector()
the method expects industry representation in lowercase letters.
Starting python embed.py
generated the following word vector for software
word:
[-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804,
0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243,
0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315,
-0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384,
0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905,
-0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829,
0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818,
-0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664,
0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612,
-0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511,
-0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679,
0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724,
-0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787,
-0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221,
-0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849,
-0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517,
-0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596,
0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259,
-0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406,
-0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419]
Create a table for industries
In order to store word vectors, we had to add industries
table into the Postgres database using the following SQL command:
create table salesforce.industries
(
name varchar not null constraint industries_pk primary key,
embeddings vector(100) not null
);
with industries
table created, we will insert each of the generated word vectors. We do this using SQL statements similar to the following:
INSERT INTO salesforce.industries
(name, embeddings)
VALUES
('Software','[-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804, 0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243, 0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315, -0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384, 0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905, -0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829, 0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818, -0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664, 0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612, -0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511, -0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679, 0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724, -0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787, -0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221, -0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849, -0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517, -0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596, 0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259, -0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406, -0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419]
');
Please note – although we have created a vector of lowercase words representing the software industry (software), industries.name
the column must correspond to the industry name (software) written in capital letters.
After all the generated word vectors have been added to industries
table, we can change the focus to introducing a RESTful API.
Introduce Spring Boot Service
This was the point where my passion as a software engineer kicked into high gear, as I had everything ready to tackle the challenge at hand.
Then, using Spring Boot 3.2.2 and Java (temurin) 17, I created similarity-search-sfdc
project in IntelliJ IDEA with the following Maven dependencies:
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>com.pgvector</groupId>
<artifactId>pgvector</artifactId>
<version>0.1.4</version>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
I created simplified entities for both Account object i Industry (embedding) object, which is aligned with the Postgres database tables created earlier.
@AllArgsConstructor
@NoArgsConstructor
@Data
@Entity
@Table(name = "account", schema = "salesforce")
public class Account
@Id
@Column(name = "sfid")
private String id;
private String name;
private String industry;
@AllArgsConstructor
@NoArgsConstructor
@Data
@Entity
@Table(name = "industries", schema = "salesforce")
public class Industry
@Id
private String name;
Using the JpaRepository interface, I added the following extensions to allow easy access to Postgres tables:
public interface AccountsRepository extends JpaRepository<Account, String>
@Query(nativeQuery = true, value = "SELECT sfid, name, industry " +
"FROM salesforce.account " +
"WHERE industry IN (SELECT name " +
" FROM salesforce.industries " +
" WHERE name != :industry " +
" ORDER BY embeddings <-> (SELECT embeddings
FROM salesforce.industries
WHERE name = :industry) " +
" LIMIT :limit)" +
"ORDER BY name")
Set<Account> findSimilaritiesForIndustry(String industry, int limit);
public interface IndustriesRepository extends JpaRepository<Industry, String>
Notice findSimilaritiesForIndustry()
The method is where all the hard work will take place to solve this use case. The method will accept the following parameters:
industry
: an industry for which similarities should be foundlimit
: the maximum number of industry similarities to search against when searching for accounts
Note the Euclidean distance operator (<->
) in our query above. This is a built-in extension operator to perform a similarity search.
With the original use case of “Software” in industry and the limit set to the three closest industries, the query executed would look like this:
SELECT sfid, name, industry
FROM salesforce.account
WHERE industry
IN (SELECT name
FROM salesforce.industries
WHERE name != 'Software'
ORDER BY embeddings
<-> (SELECT embeddings
FROM salesforce.industries
WHERE name="Software")
LIMIT 3)
ORDER BY name;
I built from there AccountsService
class to interact with JPA repositories:
@RequiredArgsConstructor
@Service
public class AccountsService
private final AccountsRepository accountsRepository;
private final IndustriesRepository industriesRepository;
public Set<Account> getAccountsBySimilarIndustry(String industry,
int limit)
throws Exception
List<Industry> industries = industriesRepository.findAll();
if (industries
.stream()
.map(Industry::getName)
.anyMatch(industry::equals))
return accountsRepository
.findSimilaritiesForIndustry(industry, limit);
else
throw new Exception(
"Could not locate '" + industry + "' industry");
In the end, I had AccountsController
classes to provide a RESTful entry point and connect to AccountsService
:
@RequiredArgsConstructor
@RestController
@RequestMapping(value = "/accounts")
public class AccountsController
private final AccountsService accountsService;
@GetMapping(value = "/similarities")
public ResponseEntity<Set<Account>> getAccountsBySimilarIndustry(@RequestParam String industry, @RequestParam int limit)
try
return new ResponseEntity<>(
accountsService
.getAccountsBySimilarIndustry(industry, limit),
HttpStatus.OK);
catch (Exception e)
return new ResponseEntity<>(HttpStatus.NOT_FOUND);
Deploy to Heroku
With the Spring Boot service ready, I added the following Procfile
project, letting Heroku learn more about our service:
web: java $JAVA_OPTS -Dserver.port=$PORT -jar target/*.jar
Just to be sure, I added system.properties
to specify which versions of Java and Maven are expected:
java.runtime.version=17
maven.version=3.9.5
Using the Heroku CLI, I added the remote to my GitLab repository for similarity-search-sfdc
service for the Heroku platform:
heroku git:remote -a similarity-search-sfdc
I also set the buildpack type for similarity-search-sfdc
service via the following command:
heroku buildpacks:set https://github.com/heroku/heroku-buildpack-java
I finally deployed similarity-search-sfdc
service to Heroku using the following command:
Now, Resources card for similarity-search-sfdc
the application appeared as shown below:
Looking for similarities in action
With the RESTful API running, I issued the following cURL command to locate the top three Salesforce industries (and associated accounts) that are closest to Software industry:
curl --location 'https://HEROKU-APP-ROOT-URL/accounts/similarities?industry=Software&limit=3'
The RESTful API returns a 200 OK HTTP
response status along with the following content:
[
"id": "001Kd00001bsP80IAE",
"name": "CleanSlate Technology Group",
"industry": "Technology"
,
"id": "001Kd00001bsPBFIA2",
"name": "CMG Worldwide",
"industry": "Media"
,
"id": "001Kd00001bsP8AIAU",
"name": "Dev Spotlight",
"industry": "Technology"
,
"id": "001Kd00001bsP8hIAE",
"name": "Egghead",
"industry": "Electronics"
,
"id": "001Kd00001bsP85IAE",
"name": "Marqeta",
"industry": "Technology"
]
As a result, Technology, Mediaand Electronics industries are the closest industries Software industry in this example.
Now the marketing department has a list of accounts they can contact for their next campaign.
Conclusion
Many years ago I spent more time than I care to admit playing the multiplayer video game Team Fortress 2. Here’s a screenshot from the 2012 event, which was a lot of fun:
Those familiar with this aspect of my life could tell you that my default choice of player class was a soldier. This is because the soldier has the best balance of health, movement, speed and firepower.
I feel that software engineers are the “soldier class” of the real world because we can adapt to any situation and focus on delivering solutions that meet expectations in an efficient manner.
For several years now, I have focused on the following mission statement, which I believe can be applied to any IT professional:
“Focus your time on providing features/functionality that increase the value of your intellectual property. Use frameworks, products and services for everything else.”
– J. Vester
In the example for this post, we were able to use Heroku Connect to synchronize business data with a Postgres database. After installing the pgvector extension, we created word vectors for each unique industry from those Salesforce accounts. Finally, we introduced the Spring Boot service, which simplified the process of locating Salesforce accounts whose industry was closest to another industry.
We quickly solved this use case with existing open source technologies, adding a tiny Spring Boot service and Heroku PaaS — fully adhering to my mission statement. I can’t imagine how much time it would take without these frameworks, products and services.
If you are interested, the original source code for this article can be found on GitLab.
Have a nice day!