In today’s digital age, professionals in all industries need to stay up to date with upcoming events, conferences and workshops. However, effectively finding events that align with one’s interests in the vast ocean of Internet information presents a significant challenge.
This blog presents an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale. Although MyScale is typically associated with the RAG technology suite or used as a vector database, its capabilities extend beyond these areas. We will use it for data analysis, leveraging its vector search function to analyze events that are semantically similar, thus providing better results and insights.
You may notice that Grok AI used the Qdrant vector database as a search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also evaluate the power of vector databases in this way with MyScale by integrating MyScale with other platforms such as Apify to improve daily life tasks through the development of simple personalized applications.
So in this blog, let’s develop an application that takes only the city name as input and scrapes all related events from Facebook. After that, we will perform data analysis and semantic search using MyScale’s advanced SQL vector capabilities.
We will be using several tools to develop this useful application, including Apify, MyScale and OpenAI.
- Apify: A popular web scraping and automation platform that greatly simplifies the data collection process. It provides data scraping and post-entry capability to LLMs. This allows us to train LLMs in real-time data and application development.
- MyScale: MyScale is a SQL vector database that we use to store and process structured and unstructured data in an optimized way.
- OpenAI: We will use the model
text-embedding-3-small
from OpenAI to get text embeddings and then save those embeddings to MyScale for data analysis and semantic search.
How to set up MyScale and Apify
To get started with setting up MyScale and Apify, you’ll need to create a new directory and Python file. You can do this by opening a terminal or command prompt and entering the following commands:
Note: We will work in a Python notebook. Think of each block of code as a notebook cell.
How to Scrape Data Using Apify
We will now use the Apify API to scrape data about events in New York Facebook Event Scraper.
Note: Don’t forget to add your Apify API key in the above script. You can find your API token on the Integrations page of the Apify console.
Data preprocessing
When we collect raw data, it comes in different formats. In this script we will bring the event dates into one format to make our data filtering more efficient.
Generating embeddings
To better understand and search events, we will generate embeddings from their descriptions using text-embedding-3-small
. These embeds capture the semantic essence of each event, helping the application return better results.
Connecting to MyScale
As explained at the beginning, we will use MyScale as a vector database for data storage and management. Here we will connect to MyScale in preparation for data storage.
Note: See Connection Details for more information on how to connect to a MyScale cluster.
Create tables and indexes using MyScale
Now we create a table according to our DataFrame. All data will be stored in this table, including embeddings.
Storing data and creating indexes in MyScale
In this step, we insert the processed data into MyScale. This includes bulk data insertion to ensure efficient storage and retrieval.
Data analysis using MyScale
Finally, we use the analytical capabilities of MyScale to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So let’s try to write some queries.
A simple SQL query
Let’s try to get it first top 10 results from the table.
Discover events by semantic relevance
Let’s try to find it top 10 upcoming events with a similar atmosphere to a reference event, such as this one: “One of the longest fairs in the country – Held since 1974 … NOW our 50TH YEAR !!! Our Schenectady”. This is achieved by comparing the semantic embeddings of event descriptions, ensuring congruence in themes and emotions.
This query ranks top 10 events by number of visitors and interested users, highlighting popular events from major city festivals to major conferences. It is ideal for those who want to join large, energetic gatherings.
By combining relevance and popularity, this query identifies similar events in New York refers to a specific event and ranks them by attendance, offering a curated list of events that reflect the city’s vibrant culture and attract local interest.
This query ranks Top 10 Event Organizers by the total number of visitors and interested users, highlighting those who excel in creating compelling events and attracting a large audience. It provides insights for event planners and visitors interested in top-level events.
We previously explored MyScale for data analytics, highlighting its capabilities in improving our data workflows. Moving forward, we will go one step further by implementing Retrieval-Augmented Generation (RAG), an innovative framework that combines an external knowledge base with LLMs. This step will help you better understand your data and find more detailed insights. Next, you’ll see how to use RAG with MyScale, which will make working with data more interesting and productive.
Conclusion
We explored the capabilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its exceptional capabilities in high-performance vector searches while retaining all the functionality of SQL databases, which helps developers perform semantic searches using familiar SQL syntax with much greater speed and accuracy.
The capabilities of MyScale are not limited to this application: you can adopt it to develop any AI application using the RAG method.
If you have feedback or suggestions, please contact us.