{ "cells": [ { "cell_type": "markdown", "id": "883a8a6a-d0b5-40ea-90a0-5b33d3332360", "metadata": {}, "source": [ "# Get Data\n", "The data from wikipedia starts in XML, this is a relatively simple way to format that into a single json for our purposes." ] }, { "cell_type": "markdown", "id": "a7d66da5-185c-409e-9568-f211ca4b725e", "metadata": {}, "source": [ "## Initialize Variables" ] }, { "cell_type": "code", "execution_count": null, "id": "ea8ae64c-f597-4c94-b93d-1b78060d7953", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "import sys" ] }, { "cell_type": "code", "execution_count": null, "id": "2f9527f9-4756-478b-99ac-a3c8c26ab63e", "metadata": { "tags": [] }, "outputs": [], "source": [ "proj_dir = str(Path.cwd().parent)\n", "proj_dir\n", "\n", "# So we can import later\n", "sys.path.append(proj_dir)" ] }, { "cell_type": "markdown", "id": "860da614-743b-4060-9d22-673896414cbd", "metadata": {}, "source": [ "## Install Libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "8bec29e3-8434-491f-914c-13f303dc68f3", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install -q -r \"$proj_dir\"/requirements.txt" ] }, { "cell_type": "markdown", "id": "b928c71f-7e34-47ee-b55e-aa12d5118ba7", "metadata": {}, "source": [ "## Download Latest Simple Wikipedia" ] }, { "cell_type": "markdown", "id": "f1dc5f57-c877-43e3-8131-4f351b99168d", "metadata": {}, "source": [ "Im getting \"latest\" but its good to see what version it is nonetheless." ] }, { "cell_type": "code", "execution_count": null, "id": "fe4b357f-88fe-44b5-9fce-354404b1447f", "metadata": { "tags": [] }, "outputs": [], "source": [ "!curl -I https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2 --silent | grep \"Last-Modified\"" ] }, { "cell_type": "markdown", "id": "fe62d4a3-b59b-40c4-9a8c-bf0a447a9ec2", "metadata": {}, "source": [ "Download simple wikipedia" ] }, { "cell_type": "code", "execution_count": null, "id": "0f309c12-12de-4460-a03f-bd5b6fcc942c", "metadata": { "tags": [] }, "outputs": [], "source": [ "!wget -P \"$proj_dir\"/data/raw https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles-multistream.xml.bz2" ] }, { "cell_type": "markdown", "id": "46af5df6-5785-400a-986c-54a2c98768ea", "metadata": {}, "source": [ "## Extract XML into jsonl" ] }, { "cell_type": "code", "execution_count": null, "id": "c22dedcd-73b3-4aad-8eb7-1063954967ed", "metadata": {}, "outputs": [], "source": [ "!wikiextractor -o \"$proj_dir\"/data/raw/output --json simplewiki-latest-pages-articles-multistream.xml.bz2 " ] }, { "cell_type": "markdown", "id": "bb8063c6-1bed-49f0-948a-eeb9a7933b4a", "metadata": {}, "source": [ "## Consolidate into json\n", "\n", "Some of this is boring, and most people dont care how you format it, just that its correct. Feel free to check out the consolidate file for more details." ] }, { "cell_type": "code", "execution_count": null, "id": "0a4ce3aa-9c1e-45e4-8219-a1714f482371", "metadata": { "tags": [] }, "outputs": [], "source": [ "from src.preprocessing.consolidate import folder_to_json" ] }, { "cell_type": "code", "execution_count": null, "id": "3e93da6a-e304-450c-a81e-ffecaf0d8a9a", "metadata": {}, "outputs": [], "source": [ "folder = proj_dir / 'data/raw/output'\n", "file_out = proj_dir / 'data/consolidated/simple_wiki.json'\n", "folder_to_json(folder, file_out)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }