记录一下参加北京的Data for AI meeting | Michael Lu’s Blog

Michael Lu’s Blog

Explore everything!

往期整理

记录一下参加北京的Data for AI meeting

2024-9-21

| 2024-9-23

Words≈0 | Read Time ≈ 0 min

type

status

date

slug

summary

tags

category

icon

password

20240921参加了一个在北京的Data for AI meeting。我认为还是受益匪浅的，简要记录一下自己在这次活动的一些思考和记录。

首先还是得感谢主办方Datastrato！

先总结一下今天的一些感悟：1.在当今这个社会，想要一个产品moving fast，开源是一个很重要的手段。同时我认为对自己来说也是一个提示，多去参与开源项目，不能说带着功利化的心态，但是新的东西，越来越多的公司是拥抱社区拥抱开源的。2.这次会议着重点在于数据。ai时代三大主题，算力算法数据，或许谁能抓住，解决一些痛点，就能有一些革命性的产品。3.对于自己来说，需要关于一下RAG这个东西，还是要花时间去学一下，了解清楚，因为这个相当于玩游戏开了个外挂，如果知道有了外挂，体验可能更好吧。

这个会议有很多个session，但我按照自己的笔记记录一下一些关键的点。

1.90%都是 unstructured data

我好像也是第一次听到这个数据和这个名词，但从我的理解是，从最简单的数据库角度出发，就是unstructed data就无法像传统的一样，用row 和 columns放进sql数据库里。那么带来的价值可能是，更偏向于人们平时说话的习惯。

对于社交平台类的软件，收集到的毫无疑问更多是unstructured data。重要的是怎么从里面挖掘到信息。

notion image

2.几个概念 metadata,catalog,data lake

metadata —→ provides data about data

Metadata and data are intertwined:

Metadata provides the necessary context and rules for the data stored in the database.

In Databases:

Data is stored in user tables.
Metadata is stored in system catalogs or data dictionaries.

Roles:

Metadata:

Defines the structure of the database.
Enables data integrity, optimization, and security.

Data:

The actual information you store, retrieve, and manipulate.

catalog —>data discovery+meta data management+data governance;

data lake:centralized repo for structured and unstructured data. keep raw form.

那和data warehouse最大区别： warehouse has predefined schema, store structured data

3.Difficulties in unifying authorization in terms of metadata

1.metadata涉及到隐私 2. policy原因 3.存储数据方式不同

4.vector database in llm

notion image

先了解一下什么是vector database

假设你有一个包含百万级图片的数据库，每张图片都被转换为一个128维的向量。

向量数据：

元数据：

ID： 每个向量对应的图片ID，如image_001、image_002等。
标签： 如"猫"、"狗"、"风景"等。
其他属性： 上传日期、作者信息等。

数据库存储：

记录形式： 每条记录包含向量和其元数据。
索引建立： 数据库根据向量数据建立索引，以支持快速的相似度搜索。

查询过程：

当你想查找与某张图片相似的图片时：

输入数据： 提供查询图片，将其转换为128维的向量。

相似度计算：

索引检索： 数据库使用索引快速定位可能相似的向量。

距离计算： 计算查询向量与候选向量之间的距离（如欧氏距离、余弦相似度）。

返回结果： 按照相似度排序，返回最相似的图片及其元数据。

embedding的概念：

Embeddings are dense numerical representations of data where similar items are positioned close to each other in a high-dimensional space.

比如说

like "king" and "queen" have embeddings that are close together in the vector space

magine building a search engine for an e-commerce platform:

Traditional Search Limitations:

Keyword-based searches may not capture user intent or handle synonyms well.

With Vector Databases and Embeddings:

Step 1: Use a language model to generate embeddings for product descriptions.
Step 2: Store these embeddings in a vector database.
Step 3: When a user searches for "running shoes," convert the query into an embedding.
Step 4: Perform a similarity search to find products whose embeddings are close to the query embedding.
Result: The user receives relevant product recommendations, even if exact keywords don't match.

4.RAG

利用外部资源，相当于挂外挂来处理问题

RAG的工作原理

步骤如下：

输入处理：

用户提供一个查询或问题。

向量化：

将输入转换为嵌入向量，捕捉其语义信息。

检索相关文档：

在向量数据库中，使用输入向量检索与之相似的文档或信息。

检索结果通常是与输入最相关的若干条信息。

生成响应：

将输入和检索到的文档一起输入到生成式模型中。

模型基于这些信息生成最终的回答或内容。

notion image

RAG中的知识库通常是外挂知识库，即不包含在模型参数中的外部数据源。

目的在于增强模型的生成能力，提供更准确、丰富和最新的信息

notion image

notion image

colpali 基于图文密集型的新解法

这个是嘉宾分享的一个新东西，colpali

还是第一次听，很新，但提到是一种新的方式。

其他

其他一些嘉宾分享的内容，我觉得可以综合看看，确定下之后自己的方向。

notion image

notion image

Author:Michael Lu
URL:https://www.exploretech.top//article/108db349-4e0e-80aa-9e91-db779c89a45e
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Tags:

思考

特征工程 Prompt a flappy bird demo game

Loading...

Catalog