{"_id":"5624ed3b6ff1010d009b1646","user":"55a7ae50bf1be93100d89df1","project":"55a7aee84a33f92b00b7a150","version":{"_id":"55a7aee84a33f92b00b7a153","__v":6,"project":"55a7aee84a33f92b00b7a150","createdAt":"2015-07-16T13:17:28.411Z","releaseDate":"2015-07-16T13:17:28.411Z","categories":["55a7aee94a33f92b00b7a154","55a7fefa3efe0c2f0074cbdf","55a8fb10c8bd450d000dd130","55a936b1cf45e1390093f362","55abddeaa36ccd0d00fdebe1","5624db675a86b423009462e1"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":10,"category":{"_id":"5624db675a86b423009462e1","pages":["5624db8f85a31117001c5428","5624e72b06e8040d005ed6e4","5624ed3b6ff1010d009b1646","562502255473621900689514"],"__v":4,"project":"55a7aee84a33f92b00b7a150","version":"55a7aee84a33f92b00b7a153","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-10-19T12:00:39.932Z","from_sync":false,"order":9999,"slug":"rapidminer-extension","title":"RapidMiner Extension"},"parentDoc":null,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2015-10-19T13:16:43.556Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":2,"body":"In this tutorial we’re going to walk you through using the “Text Analysis by AYLIEN” Extension for RapidMiner, to build a “News Analyzer” that monitors and analyzes articles from a particular RSS feed, or feeds.\n\nIf you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension you should first read our [Getting Started](doc:getting-started-rapidminer) tutorial which takes you through the installation process. Also, If you haven’t got an AYLIEN account, which you’ll need to use the Extension, you can grab one [here](http://developer.aylien.com/signup?source=rapidminer).\n\nSo, here’s what we’re going to do:\n\n1. Monitor an RSS feed collecting Article updates using the **Read RSS Feed** Operator\n2. Extract the main body of text and Title from the article with the **Extract Article** Operator\n3. Analyze and categorize these articles using the **Categorize** Operator\n4. Extract Entities from the article, mentions of People, Places, Organization etc. using the **Extract Entities** Operator  \n5. Visualize our results and make them more consumable and understandable\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"Web Mining Extension\",\n  \"body\": \"Please note that this tutorial assumes you have the **Web Mining** Extension installed. You can download and install the Extension through the RapidMiner Marketplace.\"\n}\n[/block]\n**You can download the finished Process from our [Sample Processes](doc:sample-processes-rapidminer) page.** \n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Step 1. Collecting news articles\"\n}\n[/block]\nThe first step to building our News Analyzer will involve adding a **Read RSS Feed** Operator to our Process. When you add the Read RSS Feed you need to specify what RSS feed you want to monitor by adding the URL in the RSS feed input and adding your timeout counters, we’ve kept the default values:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/Lr2bvjLR4us7kP3V7Rl0_Screen%20Shot%202015-10-14%20at%207.29.55%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.29.55 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#5ab364\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Step 2. Extracting the article titles and body\"\n}\n[/block]\nTo extract the relevant pieces of text from the URLs collected we can use the **Extract Article** Operator. This will pull the main body of text, the title and any image present directly from the URL.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/DuP7RhI0TgGyq9kQeq1U_Screen%20Shot%202015-10-14%20at%207.29.57%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.29.57 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#cb5f33\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nTo prepare our extracted text for analysis we use a **Data to Document** Operator. This will transform the dataset of text to a collection of documents making it easier to categorize.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/f1yJyBoLSgmguMQWy2h8_Screen%20Shot%202015-10-14%20at%207.30.00%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.30.00 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#59b365\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nNow we need to specify which column(s) in the ExampleSet contain the text we want to create a Document from:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/k67DO2pQSXGnxdX7mRRK_Screen%20Shot%202015-10-14%20at%207.30.05%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.30.05 PM.png\",\n        \"1524\",\n        \"1044\",\n        \"#c5582e\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nThe first thing we’re going to do with the extracted text is, try and get a high level understanding for what it’s about by categorizing it based on a particular taxonomy, in this case the IAB QAG taxonomy:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/hxIub98FTnSMwMOUQStQ_Screen%20Shot%202015-10-14%20at%207.30.09%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.30.09 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#cc5f33\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nWe’ll then add a Document to Data Operator which transforms our documents back to an ExampleSet, making it easier to further process the data:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/nEc0ETH5Qse6h8tppSP2_Screen%20Shot%202015-10-14%20at%207.30.12%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.30.12 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#cc5f33\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Step 3. Extracting Entities\"\n}\n[/block]\nFinally, the last piece of analysis we’ll do on the text is extract any mention of an Entity (Keywords, People, Places, Organizations, % values, $ values etc.) using an **Extract Entities** Operator:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/XHhruJLjR2oPdx0IoT5M_Screen%20Shot%202015-10-14%20at%207.30.14%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.30.14 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#cb5f33\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Step 4. Results\"\n}\n[/block]\nNow if you connect the Operators and Run the process, your results will be displayed in an  ExampleSet tab like the one below. Each row will contain the extracted text and title, its appropriate categories as well as any Entities that were extracted separated out in columns:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/7E27ZT7S424WrDwLOQ0T_Screen%20Shot%202015-10-14%20at%207.34.41%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.34.41 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"#9a8a64\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Step 5. Visualizing the results\"\n}\n[/block]\nRapidMiner let’s you display and visualize results of your Process really easily using simple charts and visualizations like the ones below, which can all be created using the Charts widget on the left hand side of your results display.\n\nWe put together a simple pie-chart below visualizing the categories of the articles extracted with our News Analyzer:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/uELgC4PQQbe6K7acNeza_Screen%20Shot%202015-10-14%20at%207.29.16%20PM.png\",\n        \"Screen Shot 2015-10-14 at 7.29.16 PM.png\",\n        \"1552\",\n        \"1072\",\n        \"\",\n        \"\"\n      ],\n      \"caption\": \"Pie chart showing top level categories of Articles analyzed\"\n    }\n  ]\n}\n[/block]","excerpt":"","slug":"tutorial-news-analyzer-rapidminer","type":"basic","title":"Tutorial: Building an RSS/News Analyzer Process"}

Tutorial: Building an RSS/News Analyzer Process


In this tutorial we’re going to walk you through using the “Text Analysis by AYLIEN” Extension for RapidMiner, to build a “News Analyzer” that monitors and analyzes articles from a particular RSS feed, or feeds. If you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension you should first read our [Getting Started](doc:getting-started-rapidminer) tutorial which takes you through the installation process. Also, If you haven’t got an AYLIEN account, which you’ll need to use the Extension, you can grab one [here](http://developer.aylien.com/signup?source=rapidminer). So, here’s what we’re going to do: 1. Monitor an RSS feed collecting Article updates using the **Read RSS Feed** Operator 2. Extract the main body of text and Title from the article with the **Extract Article** Operator 3. Analyze and categorize these articles using the **Categorize** Operator 4. Extract Entities from the article, mentions of People, Places, Organization etc. using the **Extract Entities** Operator 5. Visualize our results and make them more consumable and understandable [block:callout] { "type": "warning", "title": "Web Mining Extension", "body": "Please note that this tutorial assumes you have the **Web Mining** Extension installed. You can download and install the Extension through the RapidMiner Marketplace." } [/block] **You can download the finished Process from our [Sample Processes](doc:sample-processes-rapidminer) page.** [block:api-header] { "type": "basic", "title": "Step 1. Collecting news articles" } [/block] The first step to building our News Analyzer will involve adding a **Read RSS Feed** Operator to our Process. When you add the Read RSS Feed you need to specify what RSS feed you want to monitor by adding the URL in the RSS feed input and adding your timeout counters, we’ve kept the default values: [block:image] { "images": [ { "image": [ "https://files.readme.io/Lr2bvjLR4us7kP3V7Rl0_Screen%20Shot%202015-10-14%20at%207.29.55%20PM.png", "Screen Shot 2015-10-14 at 7.29.55 PM.png", "1552", "1072", "#5ab364", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Step 2. Extracting the article titles and body" } [/block] To extract the relevant pieces of text from the URLs collected we can use the **Extract Article** Operator. This will pull the main body of text, the title and any image present directly from the URL. [block:image] { "images": [ { "image": [ "https://files.readme.io/DuP7RhI0TgGyq9kQeq1U_Screen%20Shot%202015-10-14%20at%207.29.57%20PM.png", "Screen Shot 2015-10-14 at 7.29.57 PM.png", "1552", "1072", "#cb5f33", "" ] } ] } [/block] To prepare our extracted text for analysis we use a **Data to Document** Operator. This will transform the dataset of text to a collection of documents making it easier to categorize. [block:image] { "images": [ { "image": [ "https://files.readme.io/f1yJyBoLSgmguMQWy2h8_Screen%20Shot%202015-10-14%20at%207.30.00%20PM.png", "Screen Shot 2015-10-14 at 7.30.00 PM.png", "1552", "1072", "#59b365", "" ] } ] } [/block] Now we need to specify which column(s) in the ExampleSet contain the text we want to create a Document from: [block:image] { "images": [ { "image": [ "https://files.readme.io/k67DO2pQSXGnxdX7mRRK_Screen%20Shot%202015-10-14%20at%207.30.05%20PM.png", "Screen Shot 2015-10-14 at 7.30.05 PM.png", "1524", "1044", "#c5582e", "" ] } ] } [/block] The first thing we’re going to do with the extracted text is, try and get a high level understanding for what it’s about by categorizing it based on a particular taxonomy, in this case the IAB QAG taxonomy: [block:image] { "images": [ { "image": [ "https://files.readme.io/hxIub98FTnSMwMOUQStQ_Screen%20Shot%202015-10-14%20at%207.30.09%20PM.png", "Screen Shot 2015-10-14 at 7.30.09 PM.png", "1552", "1072", "#cc5f33", "" ] } ] } [/block] We’ll then add a Document to Data Operator which transforms our documents back to an ExampleSet, making it easier to further process the data: [block:image] { "images": [ { "image": [ "https://files.readme.io/nEc0ETH5Qse6h8tppSP2_Screen%20Shot%202015-10-14%20at%207.30.12%20PM.png", "Screen Shot 2015-10-14 at 7.30.12 PM.png", "1552", "1072", "#cc5f33", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Step 3. Extracting Entities" } [/block] Finally, the last piece of analysis we’ll do on the text is extract any mention of an Entity (Keywords, People, Places, Organizations, % values, $ values etc.) using an **Extract Entities** Operator: [block:image] { "images": [ { "image": [ "https://files.readme.io/XHhruJLjR2oPdx0IoT5M_Screen%20Shot%202015-10-14%20at%207.30.14%20PM.png", "Screen Shot 2015-10-14 at 7.30.14 PM.png", "1552", "1072", "#cb5f33", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Step 4. Results" } [/block] Now if you connect the Operators and Run the process, your results will be displayed in an ExampleSet tab like the one below. Each row will contain the extracted text and title, its appropriate categories as well as any Entities that were extracted separated out in columns: [block:image] { "images": [ { "image": [ "https://files.readme.io/7E27ZT7S424WrDwLOQ0T_Screen%20Shot%202015-10-14%20at%207.34.41%20PM.png", "Screen Shot 2015-10-14 at 7.34.41 PM.png", "1552", "1072", "#9a8a64", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Step 5. Visualizing the results" } [/block] RapidMiner let’s you display and visualize results of your Process really easily using simple charts and visualizations like the ones below, which can all be created using the Charts widget on the left hand side of your results display. We put together a simple pie-chart below visualizing the categories of the articles extracted with our News Analyzer: [block:image] { "images": [ { "image": [ "https://files.readme.io/uELgC4PQQbe6K7acNeza_Screen%20Shot%202015-10-14%20at%207.29.16%20PM.png", "Screen Shot 2015-10-14 at 7.29.16 PM.png", "1552", "1072", "", "" ], "caption": "Pie chart showing top level categories of Articles analyzed" } ] } [/block]