Stefano Fiorucci PRO

anakin87

AI & ML interests

Contributing to Haystack, the LLM Framework šŸ—ļø. NLP / LLMs.

Articles

Organizations

Posts 9

view post
Post
517
Ok, you're finally convinced that synthetic data works... āš—ļø

ššØš° š²šØš® š°ššš§š­ š­šØ š šžš§šžš«ššš­šž ššš§ š¢š§š¬š­š«š®šœš­š¢šØš§ šššš­ššš¬šžš­ šŸšØš« šŸš¢š§šž-š­š®š§š¢š§š  š¢š§ šš š„ššš§š š®ššš šž šØš­š”šžš« š­š”ššš§ š„š§š š„š¢š¬š”.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

šŸ¦ā€ā¬› š–š”ššš­ š¢š¬ šŒššš š©š¢šž?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea šŸ‘‡
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

šŸŖ„ The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.


šŸ§—š†šžš§šžš«ššš­š¢š§š  š§šØš§-š„š§š š„š¢š¬š” šššš­šš

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

āŒ Unfortunately, it does not work well for other languages (šŸ‡®šŸ‡¹, šŸ‡³šŸ‡±, ...)

šŸ‘‡
view post
Post
1564
šŸ•µšŸ» š€š šžš§š­š¢šœ š‘š€š† š°š¢š­š” šŸ¦™ š‹š„ššš¦šš 3.2

I was excited to explore Llama 3.2, but as a simple šŸ‡ŖšŸ‡ŗ EU guy, I don't have access to Meta's multimodal models šŸ˜æ

šŸ¤” So I thought: why not challenge the small 3B text model with Agentic RAG?

šŸŽÆ The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.


Check out my experimental notebook here: šŸ““ https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb


My stack:
šŸ—ļø haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
šŸ¦™ meta-llama/Llama-3.2-3B-Instruct
šŸ¦†šŸŒ free DuckDuckGo API, integrated with Haystack

āœØ š˜›š˜©š˜¦ š˜³š˜¦š˜“š˜¶š˜­š˜µš˜“? š˜Œš˜Æš˜¤š˜°š˜¶š˜³š˜¢š˜Øš˜Ŗš˜Æš˜Ø - š˜¢ š˜§š˜¦š˜ø š˜®š˜°š˜Æš˜µš˜©š˜“ š˜¢š˜Øš˜°, š˜µš˜©š˜Ŗš˜“ š˜­š˜¦š˜·š˜¦š˜­ š˜°š˜§ š˜±š˜¦š˜³š˜§š˜°š˜³š˜®š˜¢š˜Æš˜¤š˜¦ š˜§š˜³š˜°š˜® š˜¢ š˜“š˜®š˜¢š˜­š˜­ š˜®š˜°š˜„š˜¦š˜­ š˜øš˜°š˜¶š˜­š˜„'š˜·š˜¦ š˜£š˜¦š˜¦š˜Æ š˜¶š˜Æš˜µš˜©š˜Ŗš˜Æš˜¬š˜¢š˜£š˜­š˜¦!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).