{"__v":0,"_id":"56b1edd7bf040b0d00588bbd","initVersion":{"_id":"55a7aee84a33f92b00b7a153","version":"1.0"},"project":"55a7aee84a33f92b00b7a150","user":{"_id":"55a7c77b4a33f92b00b7a1a9","username":"","name":"Amir Mohammad Saeid"},"hidden":false,"createdAt":"2016-02-03T12:08:55.490Z","fullscreen":false,"htmlmode":false,"html":"","body":"## Classifying Homepages\n\nOur [article extractor](/docs/extract) is tailored for extracting the main text of a webpage. It works best when you need to [summarize](/docs/summarize) an article, or extract [concepts](/docs/concepts) and [named entities](/docs/entities) mentioned in it. However, there are occasions where you might need to classify the homepage of a website which might be lite on contents—full of headers, bullet points and images—and our extractor is unable to provide you with enough text in order to classify it. In such scenarios we suggest you construct the text by putting together contents of headers, meta tags, and image captions of the webpage. Here's an example demonstrating this approach using our [Node.js SDK](/docs/node-sdk):\n[block:embed]\n{\n  \"html\": \"<pre><code>var _ = require('underscore'),\\n    cheerio = require('cheerio'),\\n    request = require('request'),\\n    AYLIENTextAPI = require(\\\"aylien_textapi\\\");\\n\\nvar textapi = new AYLIENTextAPI({\\n    application_id: \\\"YourApplicationId\\\",\\n    application_key: \\\"YourApplicationKey\\\"\\n});\\n\\nvar url = 'http://www.bbc.com/';\\nrequest(url, function(err, resp, body) {\\n  if (!err) {\\n    var text = extract(body)\\n    textapi.classifyByTaxonomy({'text': text, 'taxonomy': 'iab-qag'}, function(err, result) {\\n      console.log(result.categories);\\n    });\\n  }\\n});\\n\\nfunction getText(tagName, $) {\\n  var texts = _.chain($(tagName)).map(function(e) {\\n    return $(e).text().trim();\\n  }).filter(function(t) {\\n    return t.length &gt; 0;\\n  }).value();\\n\\n  return texts.join(' ');\\n}\\n\\nfunction extract(body) {\\n  var $ = cheerio.load(body);\\n  var keywords = $('meta[name=\\\"keywords\\\"]').attr('content');\\n  var description = $('meta[name=\\\"description\\\"]').attr('content');\\n  var imgAlts = _($('img[alt]')).map(function(e) {\\n    return $(e).attr('alt').trim();\\n  }).join(' ');\\n  \\n  var h1 = getText('h1', $);\\n  var h2 = getText('h2', $);\\n  var links = getText('a', $);\\n  var text = [h1, h2, links, imgAlts].join(' ');\\n\\n  return text;\\n}</code></pre>\",\n  \"url\": \"https://gist.github.com/AYLIEN/03edf443e756ad3edab0\",\n  \"title\": \"Classify Homepages\",\n  \"favicon\": \"https://assets-cdn.github.com/favicon.ico\",\n  \"image\": \"https://avatars3.githubusercontent.com/u/421953?v=3&s=400\"\n}\n[/block]\nYou can try it out in our sandbox: https://developer.aylien.com/sandbox#03edf443e756ad3edab0","slug":"common-scenarios-solutions","title":"Common Scenarios & Solutions"}

Common Scenarios & Solutions


## Classifying Homepages Our [article extractor](/docs/extract) is tailored for extracting the main text of a webpage. It works best when you need to [summarize](/docs/summarize) an article, or extract [concepts](/docs/concepts) and [named entities](/docs/entities) mentioned in it. However, there are occasions where you might need to classify the homepage of a website which might be lite on contents—full of headers, bullet points and images—and our extractor is unable to provide you with enough text in order to classify it. In such scenarios we suggest you construct the text by putting together contents of headers, meta tags, and image captions of the webpage. Here's an example demonstrating this approach using our [Node.js SDK](/docs/node-sdk): [block:embed] { "html": "<pre><code>var _ = require('underscore'),\n cheerio = require('cheerio'),\n request = require('request'),\n AYLIENTextAPI = require(\"aylien_textapi\");\n\nvar textapi = new AYLIENTextAPI({\n application_id: \"YourApplicationId\",\n application_key: \"YourApplicationKey\"\n});\n\nvar url = 'http://www.bbc.com/';\nrequest(url, function(err, resp, body) {\n if (!err) {\n var text = extract(body)\n textapi.classifyByTaxonomy({'text': text, 'taxonomy': 'iab-qag'}, function(err, result) {\n console.log(result.categories);\n });\n }\n});\n\nfunction getText(tagName, $) {\n var texts = _.chain($(tagName)).map(function(e) {\n return $(e).text().trim();\n }).filter(function(t) {\n return t.length &gt; 0;\n }).value();\n\n return texts.join(' ');\n}\n\nfunction extract(body) {\n var $ = cheerio.load(body);\n var keywords = $('meta[name=\"keywords\"]').attr('content');\n var description = $('meta[name=\"description\"]').attr('content');\n var imgAlts = _($('img[alt]')).map(function(e) {\n return $(e).attr('alt').trim();\n }).join(' ');\n \n var h1 = getText('h1', $);\n var h2 = getText('h2', $);\n var links = getText('a', $);\n var text = [h1, h2, links, imgAlts].join(' ');\n\n return text;\n}</code></pre>", "url": "https://gist.github.com/AYLIEN/03edf443e756ad3edab0", "title": "Classify Homepages", "favicon": "https://assets-cdn.github.com/favicon.ico", "image": "https://avatars3.githubusercontent.com/u/421953?v=3&s=400" } [/block] You can try it out in our sandbox: https://developer.aylien.com/sandbox#03edf443e756ad3edab0