Update README.md

7f93a8e 11 months ago

40.3 kB

	---
	license: apache-2.0
	base_model: google/flan-t5-large
	tags:
	- generated_from_trainer
	- NLPPaper_to_Question_Generation
	- Summarization
	- Long Document Summarization
	model-index:
	- name: FLAN-T5-NLP-Paper-to-Question-Generation
	results: []
	widget:
	- text: >-
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] The dominant sequence transduction models are based on complex
	recurrent or convolutional neural networks in an encoder-decoder
	configuration. The best performing models also connect the encoder and
	decoder through an attention mechanism. We propose a new simple network
	architecture, the Transformer, based solely on attention mechanisms,
	dispensing with recurrence and convolutions entirely. Experiments on two
	machine translation tasks show these models to be superior in quality while
	being more parallelizable and requiring significantly less time to train.
	Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
	task, improving over the existing best results, including ensembles by over
	2 BLEU. On the WMT 2014 English-to-French translation task, our model
	establishes a new single-model state-of-the-art BLEU score of 41.8 after
	training for 3.5 days on eight GPUs, a small fraction of the training costs
	of the best models from the literature. We show that the Transformer
	generalizes well to other tasks by applying it successfully to English
	constituency parsing both with large and limited training data.
	[Introduction] Recurrent neural networks, long short-term memory [13] and
	gated recurrent [7] neural networks in particular, have been firmly
	established as state of the art approaches in sequence modeling and
	transduction problems such as language modeling and machine translation [35,
	2, 5]. Numerous efforts have since continued to push the boundaries of
	recurrent language models and encoder-decoder architectures [38, 24, 15].
	Recurrent models typically factor computation along the symbol positions of
	the input and output sequences. Aligning the positions to steps in
	computation time, they generate a sequence of hidden states ht, as a
	function of the previous hidden state ht−1 and the input for position t.
	This inherently sequential nature precludes parallelization within training
	examples, which becomes critical at longer sequence lengths, as memory
	constraints limit batching across examples. Recent work has achieved
	significant improvements in computational efficiency through factorization
	tricks [21] and conditional computation [32], while also improving model
	performance in case of the latter. The fundamental constraint of sequential
	computation, however, remains. Attention mechanisms have become an integral
	part of compelling sequence modeling and transduction models in various
	tasks, allowing modeling of dependencies without regard to their distance in
	the input or output sequences [2, 19]. In all but a few cases [27], however,
	such attention mechanisms are used in conjunction with a recurrent network.
	In this work we propose the Transformer, a model architecture eschewing
	recurrence and instead relying entirely on an attention mechanism to draw
	global dependencies between input and output. The Transformer allows for
	significantly more parallelization and can reach a new state of the art in
	translation quality after being trained for as little as twelve hours on
	eight P100 GPUs.
	Question, Answer:
	example_title: Attention Is All You Need
	- text: >-
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] In this work, we explore prompt tuning, a simple yet effective
	mechanism for learning soft prompts to condition frozen language models to
	perform specific downstream tasks. Unlike the discrete text prompts used by
	GPT-3, soft prompts are learned through backpropagation and can be tuned to
	incorporate signal from any number of labeled examples. Our end-to-end
	learned approach outperforms GPT-3's few-shot learning by a large margin.
	More remarkably, through ablations on model size using T5, we show that
	prompt tuning becomes more competitive with scale: as models exceed billions
	of parameters, our method closes the gap and matches the strong performance
	of model tuning (where all model weights are tuned). This finding is
	especially relevant in that large models are costly to share and serve, and
	the ability to reuse one frozen model for multiple downstream tasks can ease
	this burden. Our method can be seen as a simplification of the recently
	proposed prefix tuning of Li and Liang (2021), and we provide a comparison
	to this and other similar approaches. Finally, we show that conditioning a
	frozen model with soft prompts confers benefits in robustness to domain
	transfer, as compared to full model tuning. [Introduction] With the wide
	success of pre-trained large language models, a range of techniques has
	arisen to adapt these general-purpose models to downstream tasks. ELMo
	(Peters et al., 2018) proposed freezing the pre-trained model and learning a
	task-specific weighting of its per-layer representations. However, since GPT
	(Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
	adaptation technique has been model tuning (or fine-tuning), where all model
	parameters are tuned during adaptation, as proposed by Howard and Ruder
	(2018).More recently, Brown et al. (2020) showed that prompt design (or
	priming) is surprisingly effective at modulating a frozen GPT-3 model’s
	behavior through text prompts. Prompts are typically composed of a task
	description and/or several canonical examples. This return to freezing
	pre-trained models is appealing, especially as model size continues to
	increase. Rather than requiring a separate copy of the model for each
	downstream task, a single generalist model can simultaneously serve many
	different tasks. Unfortunately, prompt-based adaptation has several key
	drawbacks. Task description is error-prone and requires human involvement,
	and the effectiveness of a prompt is limited by how much conditioning text
	can fit into the model’s input. As a result, downstream task quality still
	lags far behind that of tuned models. For instance, GPT-3 175B fewshot
	performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
	al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
	efforts to automate prompt design have been recently proposed. Shin et al.
	(2020) propose a search algorithm over the discrete space of words, guided
	by the downstream application training data. While this technique
	outperforms manual prompt design, there is still a gap relative to model
	tuning. Li and Liang (2021) propose prefix tuning and show strong results on
	generative tasks. This method freezes the model parameters and
	backpropagates the error during tuning to prefix activations prepended to
	each layer in the encoder stack, including the input layer. Hambardzumyan et
	al. (2021) simplify this recipe by restricting the trainable parameters to
	the input and output subnetworks of a masked language model, and show
	reasonable results on classifications tasks. In this paper, we propose
	prompt tuning as a further simplification for adapting language models. We
	freeze the entire pre-trained model and only allow an additional k tunable
	tokens per downstream task to be prepended to the input text. This soft
	prompt is trained end-to-end and can condense the signal from a full labeled
	dataset, allowing our method to outperform few-shot prompts and close the
	quality gap with model tuning (Figure 1). At the same time, since a single
	pre-trained model is recycled for all downstream tasks, we retain the
	efficient serving benefits of frozen models (Figure 2). While we developed
	our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
	(2021), we are the first to show that prompt tuning alone (with no
	intermediate-layer prefixes or task-specific output layers) is sufficient to
	be competitive with model tuning. Through detailed experiments in sections
	2–3, we demonstrate that language model capacity is a key ingredient for
	these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
	competitive with scale. We compare with similar approaches in Section 4.
	Explicitly separating task-specific parameters from the generalist
	parameters needed for general language-understanding has a range of
	additional benefits. We show in Section 5 that by capturing the task
	definition in the prompt while keeping the generalist parameters fixed, we
	are able to achieve better resilience to domain shifts. In Section 6, we
	show that prompt ensembling, learning multiple prompts for the same task,
	can boost quality and is more efficient than classic model ensembling.
	Finally, in Section 7, we investigate the interpretability of our learned
	soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
	and showing its competitiveness with model tuning in the regime of large
	language models. 2. Ablating many design choices, and showing quality and
	robustness improve with scale. 3. Showing prompt tuning outperforms model
	tuning on domain shift problems. 4. Proposing prompt ensembling and showing
	its effectiveness.
	Question, Answer:
	example_title: PEFT (2104.08691)
	- text: >-
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] For the first time in the world, we succeeded in synthesizing the
	room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
	pressure with a modified lead-apatite (LK-99) structure. The
	superconductivity of LK-99 is proved with the Critical temperature (Tc),
	Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
	the Meissner effect. The superconductivity of LK-99 originates from minute
	structural distortion by a slight volume shrinkage (0.48 %), not by external
	factors such as temperature and pressure. The shrinkage is caused by Cu2+
	substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
	and it generates the stress. It concurrently transfers to Pb(1) of the
	cylindrical column resulting in distortion of the cylindrical column
	interface, which creates superconducting quantum wells (SQWs) in the
	interface. The heat capacity results indicated that the new model is
	suitable for explaining the superconductivity of LK-99. The unique structure
	of LK-99 that allows the minute distorted structure to be maintained in the
	interfaces is the most important factor that LK-99 maintains and exhibits
	superconductivity at room temperatures and ambient pressure. [Introduction]
	Since the discovery of the first superconductor(1), many efforts to search
	for new roomtemperature superconductors have been carried out worldwide(2,
	3) through their experimental clarity or/and theoretical perspectives(4-8).
	The recent success of developing room-temperature superconductors with
	hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
	worldwide, which is expected by strong electron-phonon coupling theory with
	high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
	apply them to actual application devices in daily life because of the
	tremendously high pressure, and more efforts are being made to overcome the
	high-pressure problem(13). For the first time in the world, we report the
	success in synthesizing a room-temperature and ambient-pressure
	superconductor with a chemical approach to solve the temperature and
	pressure problem. We named the first room temperature and ambient pressure
	superconductor LK-99. The superconductivity of LK-99 proved with the
	Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
	magnetic field (Hc), and Meissner effect(14, 15). Several data were
	collected and analyzed in detail to figure out the puzzle of
	superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
	spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
	Capacity, and Superconducting quantum interference device (SQUID) data.
	Henceforth in this paper, we will report and discuss our new findings
	including superconducting quantum wells associated with the
	superconductivity of LK-99.
	Question, Answer:
	example_title: LK-99 (Not NLP)
	- text: >-
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] Abstract Evaluation practices in natural language generation
	(NLG) have many known flaws, but improved evaluation approaches are rarely
	widely adopted. This issue has become more urgent, since neural NLG models
	have improved to the point where they can often no longer be distinguished
	based on the surfacelevel features that older metrics rely on. This paper
	surveys the issues with human and automatic model evaluations and with
	commonly used datasets in NLG that have been pointed out over the past 20
	years. We summarize, categorize, and discuss how researchers have been
	addressing these issues and what their findings mean for the current state
	of model evaluations. Building on those insights, we lay out a long-term
	vision for NLG evaluation and propose concrete steps for researchers to
	improve their evaluation processes. Finally, we analyze 66 NLG papers from
	recent NLP conferences in how well they already follow these suggestions and
	identify which areas require more drastic changes to the status quo.
	[Introduction] There are many issues with the evaluation of models that
	generate natural language. For example, datasets are often constructed in a
	way that prevents measuring tail effects of robustness, and they almost
	exclusively cover English. Most automated metrics measure only similarity
	between model output and references instead of fine-grained quality aspects
	(and even that poorly). Human evaluations have a high variance and, due to
	insufficient documentation, rarely produce replicable results. These issues
	have become more urgent as the nature of models that generate language has
	changed without significant changes to how they are being evaluated. While
	evaluation methods can capture surface-level improvements in text generated
	by state-of-the-art models (such as increased fluency) to some extent, they
	are ill-suited to detect issues with the content of model outputs, for
	example if they are not attributable to input information. These ineffective
	evaluations lead to overestimates of model capabilities. Deeper analyses
	uncover that popular models fail even at simple tasks by taking shortcuts,
	overfitting, hallucinating, and not being in accordance with their
	communicative goals. Identifying these shortcomings, many recent papers
	critique evaluation techniques or propose new ones. But almost none of the
	suggestions are followed or new techniques used. There is an incentive
	mismatch between conducting high-quality evaluations and publishing new
	models or modeling techniques. While general-purpose evaluation techniques
	could lower the barrier of entry for incorporating evaluation advances into
	model development, their development requires resources that are hard to
	come by, including model outputs on validation and test sets or large
	quantities of human assessments of such outputs. Moreover, some issues, like
	the refinement of datasets, require iterative processes where many
	researchers collaborate. All this leads to a circular dependency where
	evaluations of generation models can be improved only if generation models
	use better evaluations. We find that there is a systemic difference between
	selecting the best model and characterizing how good this model really is.
	Current evaluation techniques focus on the first, while the second is
	required to detect crucial issues. More emphasis needs to be put on
	measuring and reporting model limitations, rather than focusing on producing
	the highest performance numbers. To that end, this paper surveys analyses
	and critiques of evaluation approaches (sections 3 and 4) and of commonly
	used NLG datasets (section 5). Drawing on their insights, we describe how
	researchers developing modeling techniques can help to improve and
	subsequently benefit from better evaluations with methods available today
	(section 6). Expanding on existing work on model documentation and formal
	evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
	propose releasing evaluation reports which focus on demonstrating NLG model
	shortcomings using evaluation suites. These reports should apply a
	complementary set of automatic metrics, include rigorous human evaluations,
	and be accompanied by data releases that allow for re-analysis with improved
	metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
	dimensions related to our suggestions (section 7), we find that the first
	steps toward an improved evaluation are already frequently taken at an
	average rate of 27%. The analysis uncovers the dimensions that require more
	drastic changes in the NLG community. For example, 84% of papers already
	report results on multiple datasets and more than 28% point out issues in
	them, but we found only a single paper that contributed to the dataset
	documentation, leaving future researchers to re-identify those issues. We
	further highlight typical unsupported claims and a need for more consistent
	data release practices. Following the suggestions and results, we discuss
	how incorporating the suggestions can improve evaluation research, how the
	suggestions differ from similar ones made for NLU, and how better metrics
	can benefit model development itself (section 8).
	Question, Answer:
	example_title: NLG-Eval (2202.06935)
	- text: >-
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] Humans have harbored a longstanding desire to acquire additional abilities through
	absorption. Super Mario serves as an embodiment of this human dream, which
	can collect items to gain extra skills such as throwing fireballs and being temporarily
	invincible. In this paper, we uncover that Language Models (LMs), either encoderor decoder-based, can obtain new capabilities by assimilating the parameters of
	homologous models without the need for retraining or GPUs. Typically, new
	abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in
	the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters).
	We initially observe that by introducing a novel operation called DARE (Drop And
	REscale), most of the delta parameters can be directly set to zeros without affecting
	the capabilities of SFT LMs and larger models can tolerate a higher proportion
	of discarded parameters. Based on this observation, we further sparsify delta
	parameters of multiple SFT homologous models with DARE and subsequently
	merge them into a single model by parameter averaging. We conduct experiments
	on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also
	merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental
	results show that: (1) The delta parameter value ranges for SFT models are typically
	small, often within 0.005, and DARE can eliminate 99% of them effortlessly.
	However, once the models are continuously pre-trained, the value ranges can grow
	to around 0.03, making DARE impractical. We have also tried to remove fine-tuned
	instead of delta parameters and find that a 10% reduction can lead to drastically
	decreased performance (even to 0.0). This highlights that SFT merely stimulates
	the abilities via delta parameters rather than injecting new abilities into LMs; (2)
	DARE can merge multiple task-specific LMs into one LM with diverse abilities.
	For instance, the merger of WizardLM and WizardMath increases the GSM8K zeroshot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following
	ability while surpassing WizardMath’s original 64.2 performance. All resources
	are available at https://github.com/yule-BUAA/MergeLM.
	[Introduction] Human beings have always expressed their ambition to acquire additional abilities through various
	ways such as movies and games. For example, in X-Men’s Apocalypse, the character can absorb the
	powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games
	can gain superpowers like throwing fireballs by absorbing in-game items. Large Language Models
	(LLMs), such as GPT-4 [45], can reasonably be considered as early iterations of artificial general
	intelligence systems, given their performance is remarkably close to human-level capabilities. In this paper, we astonishingly find that LMs, similar to Apocalypse and Super Mario, can enhance their
	capabilities by absorbing other models without the need for training or GPUs.
	Formally, Supervised Fine-Tuning (SFT) is the most widely adopted strategy for assigning taskspecific capabilities to LMs by optimizing their parameters [13, 67]. The effectiveness of SFT is
	fully evident in the alteration of the model parameters before and after SFT, referred to as delta
	parameters [12]. We initially demonstrate that SFT LM (either encoder- or decoder-based) always
	tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which
	randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the
	remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE,
	when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with
	minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the
	larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank
	structures akin to LoRA [25]. Thus, even when most of these structures are removed, resulting in a
	low-rank and extremely sparse delta parameter set, the LM can still retain its capabilities.
	Based on this observation, we can confidently merge multiple homologous SFT LMs (pre-trained
	from the same backbone) without significant concerns about the decrease in their capabilities. As
	long as a small portion of the delta parameters remains unaffected in the merging process, the abilities
	of LMs unlocked by SFT can still be preserved. We first employ DARE to eliminate redundant
	delta parameters in each model before merging, which can potentially mitigate the interference of
	parameters among multiple models [62]. Then, we apply established model merging techniques
	[59, 26, 44, 27, 62] to the parameters with reduced redundancy to create a single model with diverse
	capabilities. We conduct extensive experiments on encoder-based LMs on eight datasets from the
	GLUE benchmark, and decoder-based Llama 2 with three distinct abilities: instruction-following,
	mathematical reasoning, and code-generating. We observe that:
	(1) SFT LMs exhibit a substantial number of redundant delta parameters whether they are based on
	BERT, RoBERTa, or Llama 2. DARE allows the removal of approximately 90% or even 99% delta
	parameters without significantly affecting the performance of downstream tasks. The rescale operation
	in DARE is a crucial component to guarantee effective ablations of delta parameters. Without
	rescaling, removing only 10% delta parameters would noticeably affect performance. We attribute
	this phenomenon to the fact that rescaling helps preserve the connectivity of model parameters [46].
	(2) DARE is able to enhance the performance of most existing model merging methods when merging
	encoder-based LMs on the eight datasets from GLUE. When it comes to larger LMs based on Llama
	2, the simple parameter averaging method can already produce surprisingly good results. As shown
	in Figure 1(b), we merge WizardLM and WizardMath by combining DARE and parameter averaging,
	leading to a significant improvement of WizardLM’s mathematical reasoning ability from 2.2 to 64.2
	accuracy on GSM8K, while also modestly enhancing its instruction-following ability with win rate
	from 67.2 to 67.5 on AlpacaEval. It is worth noticing that all these benefits are achieved by solely
	using CPUs without further training. Similar improvements can also be observed when merging
	code-generating models. (3) DARE is applicable to SFT delta parameters whose value ranges are relatively small. Different
	from the observations of delta parameters, dropping only 10% fine-tuned parameters would lead to a
	catastrophic decrease in performance, even approaching zero. We also find that the delta parameters
	of SFT LMs usually stay within a range of 0.005 or less, indicating minimal modifications to the
	pre-trained LM. However, once we continue pre-training, the delta parameters can rapidly reach
	around 0.03, making DARE infeasible. This further confirms that SFT primarily unlocks the abilities
	of the pre-trained LM, rather than introducing additional abilities.
	Last but not least, we have implemented an open-sourced codebase at https://github.com/
	yule-BUAA/MergeLM, which integrates existing popular model merging methods and supports both
	encoder- and decoder-based language models. We hope this work can advance the understanding of
	how alignment works from the perspective of parameters.

	Question, Answer:
	example_title: LM-SuperMario (2311.03099)


	datasets:
	- UNIST-Eunchan/NLP-Paper-to-QA-Generation
	language:
	- en
	pipeline_tag: text2text-generation
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# FLAN-T5-NLP-Paper-to-Question-Generation

	This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on an [allenai/QASPER: a dataset for question answering on scientific research papers ](https://huggingface.co/datasets/allenai/qasper)-based [NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.

	## Target Task

	- NLP Paper's Abstract + Introduction --> {Question} [SEP] {Answer}
	- Question-based Summarization
	- Long Document Summarization
	- Scientific Paper Summarization


	## (1) How to use: Inference on CPU ( Code Snippets )
	- Inference can be slow on CPU

	### Load model directly
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
	model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
	```

	### Prompting Input
	```python
	txt = r"""
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] + {text['abstract']} + [Introduction] + {text['introduction']}
	Question, Answer:
	""".replace("\n", "")

	inputs = tokenizer(txt, max_length = 1024, truncation=True, padding="max_length", return_tensors="pt")
	```

	### For Multiple Question Generation (👍)
	```python
	num_generate_sequence = 4 #8, 16, 2, 1
	summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)
	```
	### For Single Question Generation
	```python
	summaries = model.generate(input_ids =inputs["input_ids"], max_new_tokens=100, do_sample = True, top_p = 0.95)
	```

	```python
	decoded_summaries = [tokenizer.decode(s, skip_special_tokens=False, clean_up_tokenization_spaces=True) for s in summaries]
	decoded_summaries = [d.replace("<n>", " ").replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "") for d in decoded_summaries]

	```

	## (2) Faster Inference on GPU
	- about 60x faster than (1) [CPU --> COLAB T4 GPU]

	### Additional Installation
	```python
	!pip install accelerate -q
	!pip install bitsandbytes -q
	!pip install optimum -q
	```

	### Load model directly
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,BitsAndBytesConfig
	from optimum.bettertransformer import BetterTransformer

	# load model in 4-bit
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16
	)

	tokenizer = AutoTokenizer.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation")
	model = AutoModelForSeq2SeqLM.from_pretrained("UNIST-Eunchan/FLAN-T5-NLP-Paper-to-Question-Generation", quantization_config=quantization_config)
	model = BetterTransformer.transform(model)
	```


	### For Multiple Question Generation (👍)
	```python
	# use to(device)

	num_generate_sequence = 16 # (about 20 sec with Colab T4 GPU)
	summaries = model.generate(input_ids =inputs["input_ids"].to(device), max_new_tokens=100, do_sample = True, top_p = 0.95, num_return_sequences = num_generate_sequence)
	```


	### Training results


	It achieves the following results on the evaluation set:
	- Loss: 0.4504

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|
	\| No log \| 0.99 \| 46 \| 34.6109 \|
	\| 29.7732 \| 1.99 \| 92 \| 16.5236 \|
	\| 29.7732 \| 2.98 \| 138 \| 4.6887 \|
	\| 7.9911 \| 3.97 \| 184 \| 0.5679 \|
	\| 7.9911 \| 4.97 \| 230 \| 0.4795 \|
	\| 0.6152 \| 5.96 \| 276 \| 0.4577 \|
	\| 0.6152 \| 6.95 \| 322 \| 0.4523 \|
	\| 0.4811 \| 7.95 \| 368 \| 0.4509 \|
	\| 0.4811 \| 8.94 \| 414 \| 0.4505 \|
	\| 0.4721 \| 9.93 \| 460 \| 0.4504 \|

	## Model description

	- FLAN-T5-Large (783M)



	### Generated Output Example
	- Our model generate 16 different Q-A Pair with top-p sampling.

	```python
	input: r"""
	Generate Question, Answer pair correspond to the following research paper.
	[Abstract] In this work, we explore prompt tuning, a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed prefix tuning of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning. [Introduction] With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or fine-tuning), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).More recently, Brown et al. (2020) showed that prompt design (or priming) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to freezing pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning. Li and Liang (2021) propose prefix tuning and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks. In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This soft prompt is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2). While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale. We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the generalist parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that prompt ensembling, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing prompt ensembling and showing its effectiveness.
	Question, Answer:
	""".replace("\n", "")

	output= [' What was the size of each untrained model?[SEP] The size of the model can be a combination of the size of all the parameters in a model',
	' What are the benefits of using soft prompts?[SEP] They reduce the need to use manual prompt design and conserve machine training data',
	' What is the sample size of dataset?[SEP] 22840',
	' How does the method outperform some of the pre-trained models?[SEP] They successfully tune their model for two tasks, one for a few shot and the other for several downstream tasks.',
	' What is the sample size of the experiments?[SEP]135 for a simple task?[SEP]32 for a more complicated task',
	' What is the baseline model they tested? [SEP] GPT-3 model, with four state-of-the-art examples in a masked language model',
	' What task accuracy is given by prompts?[SEP]Mixed task efficiency was 93% and accuracy 85% compared to normal noise level',
	' What metrics do they use?[SEP] EMO score, VSD, and SVM scores',
	' What metrics are used to assess the performance of the soft prompt training?[SEP] quality of translation, accuracy of text-to-text, robustness of domain transfer, error rate.',
	' How much do they experiment with the T5 baseline?[SEP] The baseline is used for simulated benchmarks.',
	' Which task are they applying their method to?[SEP]They test their approach on classifications tasks',
	" Why do they show that their approach outperforms GPT-3's few-shot? [SEP] This is a large project that uses a multi-task approach to train GPT-3 models. In this paper, they demonstrate that the current method outperforms both the GPT-3 few-shot and the Li and Liang prefix tuning. They also show that the prefix tuning performed much better than the model tuning. What is the difference between their experiments",
	' How do they compare with other techniques? [SEP] They provide a comparison for each approach.',
	' Which task is the GPT-3 model most applicable to?[SEP]Classification tasks. For which tasks does the model need a subnetwork?[SEP]Classification tasks for GPT-3',
	' What is the baseline test case used for this experiment?[SEP]Pompets for a variety of tasks are trained using the same method. This is the baseline, and the baseline is used for all applications.',
	' What was the size of their model?[SEP] They experimented with 0.5 m.m and 0.5 m.m respectively.']

	```

	## Inference Examples
	```
	If Inference API generate bad, you can use model.generate() in your code for better output!
	```

	- (1) Attention is All You Need
	- (https://arxiv.org/abs/1706.03762)
	- (2) The Power of Scale for Parameter-Efficient Prompt Tuning
	- (https://arxiv.org/abs/2104.08691)
	- (3)(LK-99 Paper/ Not an NLP paper) The First Room-Temperature Ambient-Pressure Superconductor
	- (https://arxiv.org/abs/2307.12008)
	- (4) Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
	- (https://arxiv.org/abs/2202.06935)
	- (5) Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
	- (https://arxiv.org/abs/2311.03099)



	## Training and evaluation data
	- Used Dataset: [UNIST-Eunchan/NLP-Paper-to-QA-Generation](https://huggingface.co/datasets/UNIST-Eunchan/NLP-Paper-to-QA-Generation) dataset.
	- Train: dataset['train'] + dataset['test']
	- Evaluation: dataset['validation']

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 184
	- num_epochs: 10