Text Analysis Platform – User Guide

Welcome to the AYLIEN Text Analysis Platform, or TAP for short. Using TAP, you can easily build custom Natural Language Processing models without leaving your browser, and then make calls to your models just like any other API.

With TAP, you can leverage the underlying technology of AYLIEN's Text Analysis API and build tools for your own specific use case, whatever that may be. So if you have a task that you could simplify with text analysis but you can't find the right API for it, you can now easily build your own API in TAP.

Whether you want to classify your organization's incoming messages according to what team can best respond to them, or you want to build a sentiment analysis model specifically for your market niche, TAP is an advanced but easy-to-use platform from which you can build your own custom NLP solutions.

Getting Started with TAP

To get started building your first custom model, sign up to grab your login credentials for TAP. We're currently in public Beta testing, so we're offering a free plan that allows you to deploy a custom model and make 30,000 calls to it per month without charge.

Once you log in, you can take the interactive tour of TAP from the homepage, or you can dive right in and start building your first custom model.

Building an accurate Model consists of three phases: first you need to upload a Dataset, then train your Model on that Dataset, and finally you test how accurate your Model is. All of this is an iterative process, so after you've tested your Model you will most likely see places to improve accuracy by tweaking your Dataset and retraining your Model.

Datasets

In order to build a custom model, you need to train it on a Dataset of relevant, labeled data. This labeled data consists of documents (such as Tweets, emails, reviews – anything your Model will analyze in the real world) with labels attached to each document (whatever labels you want your Model to assign to the real-world data). After your model is trained on your Dataset, it will be able to analyze new data and predict what label to assign to it, based on what it has learned from the Dataset.

TAP simplifies the Dataset-building process by letting you gather labeled data in the TAP app before converting it to a Dataset according to your specifications. This means you can build your Dataset from one source, or a combination of multiple sources.

Intro three sources

You can build your Dataset from three sources:

  • Upload labeled samples of textual data in a CSV file,
  • Upload sample text data as text files and label this data in the TAP app,
  • If you have no sample text data, you can leverage TAP's Knowledge Base, a repository of millions of labeled text documents.

You can also fork another TAP user's Dataset (provided they have made it public), or browse the Marketplace to see if there is a Dataset that suits your use case.

Creating a Dataset

To get started building your Dataset, click on the Create Dataset button, which you can find on both the My Datasets page, and your TAP homepage.

Create dataset one
This will take you to the Setup stage of creating a Dataset, where you select whether to use a CSV file you have prepared or to create a new Dataset within TAP.
Create dataset button
If you have all of the labeled data you need to create your entire Dataset ready in one CSV file, you can create your Dataset with the Upload CSV button. But you can also create a Dataset from text files, from TAP's Knowledge Base, or a mixture of any of the three. If you want to create your Dataset by these means, select the Create a new Dataset option. This will essentially create an empty Dataset that you can populate with data from other sources.

Whichever option you choose at this point, TAP allows you to add more data from any source to your Dataset at any point.

From a CSV File

TAP allows you to create a Dataset from a single CSV file of labeled sample text documents. When uploading your CSV file, you'll need to ensure that your data is labeled correctly and in the correct format, so TAP can easily and efficiently train your Model. Your CSV file should consist of a column of text documents you have gathered and a column of the labels you want your Model to associate with each document.

After clicking on the Upload CSV button and selecting the CSV file from your computer, you will notice that your Running Jobs icon is whirring, and a popup will appear once your file has finished uploading. When it does, click on Convert, which will take you to the Dataset Preview page, where you will be able to see some sample rows of your data.

Dataset preview 2


You'll need to complete three actions on the Dataset Preview page:

  1. You need to specify the role of each column – whether it is a Document, a Label or something else (Attribute). Note that if you included a header row in your CSV file, you still need to declare which column is the document and which is the label – TAP sees the header simply as another row of data.

    Document button


  2. You need to declare whether the first row of your data is a Header Row or not (that is, whether it contains sample data or the titles of the columns).

Header slider



3. In the Column Data Type options, select the type of data that is in each column. Make sure that "String" is selected for your your Document and Label columns (this type will be selected by default).

String button

From Text Files

If your input data is in the form of individual text files, you can also use these files to create a new Dataset, or add to an existing Dataset. Since text files don't let you label documents like CSV files do, you'll need to create your taxonomy first and upload your text files into that taxonomy.

So to create a Dataset from text files in TAP, we simply create a folder for each label in our taxonomy, and then fill each folder with the text documents that correspond to that label. For example, if we are building a Dataset with two labels, "positive" and "negative," we will create a folder named "positive" and a folder named "negative". We will then fill the "positive" folder with text files containing positive documents, and the "negative" folder with text files containing negative documents.

To start creating your Dataset from a text file, select the Create a New Dataset option on the Set Up page. After you name your Dataset, you'll be taken to the Build page, where your taxonomy is shown as a single root folder.

Dataset overview

To add labels to your taxonomy, select the folder and click on Add a Label. This will let you create labels/subfolders within this label/folder, which you should name with the labels you wish to classify your documents with.

Once you have built your taxonomy, you can upload the text files from your computer into the appropriate subfolders by clicking anywhere in the File Upload section of the page. After you have uploaded your text files, you will see them in the subfolder as CSV files once you have uploaded them. TAP has converted them to CSV files because it has added the title of the subfolder as a label to each of the documents within it.

Taxonomy

You can add, delete, or rename files and folders after they've been added, all by selecting them and clicking on the buttons above the taxonomy, or by right-clicking on each file or folder.

You can also preview and inspect details about each file you have uploaded to make sure each node contains the correct data. Select a file and information will appear on the right, and click on the Preview file button to see samples of the documents and labels.

Node view

From Knowledge Base

If you don't have enough data to build a good Dataset, or even if you don't have any data at all, you can use Knowledge Base to source labeled data. Knowledge Base is TAP's repository of millions of labeled text documents that you can leverage to build or add to your Datasets.

You can access Knowledge Base after you create a Dataset by selecting a subfolder in your taxonomy. After clicking on Launch Knowledge Base, it will launch in a popup.

Knowledge base popup 2

To search Knowledge Base, you first need to select which document repository you want to search, as the metadata and the documents in each source are different. You can build a query from multiple parameters using Boolean expressions, so a search for Amazon reviews that contain the words 'Camera' and 'Lens' but not the word 'Canon' will read: "Camera AND Lens NOT Canon".

You can build a search with other parameters, depending on the repository, by clicking on these parameters that are located beneath the search bar.

Once you narrow down the results you need, you can either cherry pick the documents that you find suitable for your Dataset, or import up to 10,000 documents in bulk for each node of your Dataset, which you can then preview to ensure it is in order.

Forking a Dataset

If you don't want to create your own Dataset from scratch, you can fork one from another user, provided that user has made the Dataset public. Forking a Dataset allows you to make a copy of a Dataset to your TAP account, from which you can add or remove files and nodes, or add, remove, or change data within these nodes.

To fork a Dataset, find the Dataset you wish to begin work on in the Marketplace, click on the Dataset's name, and click the Fork button in the top right of the screen.

Preparing a Dataset for training and testing a Model

In order to build a Model, we need to train it on labeled data, but after this training period we also need to test the Model on some other similar data. A common practice in Machine Learning for evaluating models is to set aside a portion of the overall labeled dataset (often called a 'test set') that the model will never see during training. Once the model is trained on the training portion of the labeled dataset, it is asked to make predictions on the data found in the test set. These predictions are then compared to the known labels from the test set, and used to calculate the performance of the trained model.

TAP lets you choose between two methods of specifying your training and test data: Collections and Splitting. Using Collections, you can manually specify which portions of your Dataset should be used for training and testing, while using TAP's Splitter, you can let TAP choose a subset of your data at random as your held out test set, and use the rest for training.

Collections

There are two ways to assign a Collection to the documents in a Dataset.

If you're uploading a CSV file, you can add a column that consists of either "test" or "training" labels, marking each example you want to hold back for testing with "test". Once you've uploaded your CSV file, select the Collections attribute for this column on the Dataset Preview page.

Collections attribute

If you're using text files or importing data from Knowledge base, you'll need to have a file (or multiple files) containing your test examples and no training examples. After uploading this file to your Dataset and clicking on it in your taxonomy, you'll be able to select Test in the Collection options.

Selecting collection

Splitting your Dataset

Splitting your Dataset randomizes the selection of training test data, according to the ratio you specify. So if you select a 90% split rate, TAP will randomly select 10% of the documents in your Dataset across all labels and hold them back for testing.

TAP makes choosing a splitting ratio easy with an intuitive slider function, which shows you how many documents will be used for training and testing at each ratio. Once you're happy with the ratio, click Split Dataset and TAP will take you to the next stage – training your Model.

What ratio should I split my dataset into?

Picking the appropriate training:test split ratio for your dataset depends on a couple of different factors. If you want the most accurate model possible, a higher train:test ratio is recommendable (for example, 90% training data to 10% test data). But if you want to see how well your model can generalize to real-world data, you should specify a lower train:test ratio (around 50:50).

However, an important factor to keep in mind when choosing your train:test ratio is the size of your dataset. If you have a small dataset (a couple of thousand documents or fewer), using a low train:test ratio might leave your model without enough data to properly train on, which will produce an inaccurate model.

A useful test to employ here is to test the performance of a model when trained on each ratio - first train a model with a high training:test ratio, then train one with a lower ratio. If these models achieve comparable performance, you can say that you have an accurate model that will generalize well to real-world data.

Models

In Natural Language Processing, we leverage Machine Learning techniques to build statistical models that mimic our ability to understand and use language. These models are often 'trained' on large volumes of textual data that are tagged with labels according to the task at hand. The training process involves building an abstract representation of the training examples, and mapping them to the assigned labels, and later using these learned mappings to predict labels of new, unseen documents.

TAP allows you to leverage this Natural Language Processing technology to build custom language analysis models and use them in whatever it is you're building, all from within your browser.

After you train your own NLP model you can use it within your applications by exposing it as a web-based API.

Training a Model

After you have built your Dataset and prepared it for training, the first step in training your Model is to select the parameters for the training process. Parameters determine how your Model will be trained, and can determine the performance and the overall quality of your Model.

These parameters can be quite advanced, so TAP users can select from simplified parameter settings that are optimized for shorter, social media-style content, or longer, news article-style content. However, advanced users can select the setting of each parameter manually.

TAP allows you to tweak some of the common training parameters:

N-gram size

When analyzing the sample documents, your Model will break down all of the text into strings of letters and words and analyze the text data like this – even though these might be unintelligible to a human reader. This allows your Model to become more familiar with patterns in the data.

Setting the number of characters (N) that will be in each string (gram) allows the user to train their Model on a larger number of shorter strings, or a smaller number of longer strings.

Longer n-grams allow the model to capture longer sequences of words, such as multi-word expressions. This can be helpful when building models for distinguishing topics from one another. For instance with an n-gram size of one, "Real Madrid" will be seen by the Model as two fairly common and non-distinguishing words, whereas with an n-gram size of two, the model will see the entire phrase which is highly indicative of the sports category.

Removing stopwords

"Stopwords" refer to common words that don't convey much meaning, like "the," "a," and "and". Removing stopwords means that your Model won't be seeing these words,  and will be trained on a cleaner Dataset. TAP recognizes stopwords based on a manually curated, comprehensive list, but since what defines a stopword can vary depending on context, TAP users can have this list altered for a specific industry or niche use case.

Since shorter documents like Tweets contain such little text, it's better not to exclude while training your Model. This is because often in shorter documents stopwords carry useful information, and also because there would be so little text data left over.

Converting all text to lowercase

By lowercasing all letters in your documents, you can normalize the different ways in which words in your corpus might be capitalized (e.g. by converting "Books" and "BOOKS" to "books"), and thus make it easier for your Model to learn about those words regardless of how they are written.

  • Setting the maximum vocab size: Using the maximum vocabulary size parameter, you can limit the total number of unique words that your Model will take into account when analyzing your documents. This limit is applied to a frequency-sorted list of all words found in your corpus, meaning if you set it to 5,000, it will only choose the 5,000 most frequently used words from your corpus for training a Model.
  • Setting the number of training epochs: Training epochs indicates the number of times the Model will iterate over the entire training data. Higher numbers of training epochs often lead to better Model performance.

Setting the learning rate

When your Model is being trained, it analyzes the sample documents and learns patterns in the language data. The learning rate refers to how quickly your Model will learn these patterns – a small learning rate will mean your Model makes slow but accurate progress, while a large learning rate means your Model will make quick progress but could end up with some serious misunderstandings about the data.

If you have noisy data, it's best to set a small learning rate, since your Model might require some time to recognize the correct patterns in your data.

Evaluating a Model

Training a Model is an iterative process, where you test your newly-trained Model, use the results of these tests to inform updates you make to the Dataset, and then retrain your Model on this Dataset. This process continues until you are satisfied with the accuracy of your Model.

In your Dataset, you have separated the labeled data into training and test data, either by manually selecting the data for testing or by splitting your Dataset at random. After you click Train Your Model, TAP trains your Model on your Dataset and then tests the Model by showing it the data that has been allocated as test data, collecting its predictions, and comparing them to the actual labels from the test set.

When your Model is finished training and testing, you will be shown the Evaluation Results page, where you will be presented with the results of your Model's predictions of the test data samples.

The Evaluation Results page comprises of four sections, each with statistics on your Model's accuracy:

  • Model Performance
  • Confusion Matrix
  • Sample Predictions
  • Live Evaluation

Model Performance

The Model Performance tables show you how your Model performed across three major metrics:

Accuracy

Accuracy is a simple measure of the accurate predictions divided by the inaccurate predictions. This is useful if your documents are evenly spread across labels and you need a quick measure of how accurate your Model is, but if you need detail on your Model's performance, you will need to look into the Precision, Recall, and F1 scores.

Precision

The precision score shows how accurate the predictions made by your model were. More specifically, it measures how many of the predictions that it made were correct.

Imagine, for example, we are testing our model on 1,000 Tweets we know are positive and 1,000 Tweets we know are negative. Sometimes our model will classify positive Tweets as "Positive" and negative Tweets as "Negative". But it will also make mistakes, classifying positive Tweets as "Negative" and vice versa. The precision score measures, out of all the predictions of "Positive," how many were actually positive. In other words, it measures how precise the predictions it made were.

The precision is measured on a scale of 0 to 1, with 0 being completely inaccurate and 1 being totally accurate.

Recall

The recall score shows how comprehensive your Model's accurate predictions were. Again, imagine we have 1,000 Tweets we know are positive and 1,000 Tweets we know are negative. And remember that our Model will make some correct predictions and some incorrect predictions. The recall score measures the percentage of each of the labels that our Model correctly identified as that label, and the percentage that it missed. In other words, of all the positive Tweets we had in our Dataset, how many did the Model recall as "Positive?"

On a scale of 0 to 1, a score of 0.2 would indicate that the Model only classified 20% of the positive Tweets as "Positive," whereas a score of .9 would indicate that of all the positive Tweets, our classifier classified 90% of them as "Positive".

F1

The F1 score is a weighted average of the Precision and Recall scores. The reason we include this is because it's important to measure both at once – it's no good if your model is very precise but has very little recall. In other words, if your model makes one prediction and predicts the correct classification, it will have a perfect prediction score of 1 (meaning it is totally accurate), but there could be hundreds of documents with the same label that it didn't correctly predict so the recall score will be terrible.

For example, in a sentiment analysis model, a precision score asks: "of all of the positive predictions, how many were actually positive?" A recall score asks: "of all of the documents that were positive, how many did my model identify as positive?"  The F1 score combines these two questions into one score.

In addition, the F1 score also takes the distribution of your documents across labels into account, assigning more weight to the classifications of labels have fewer training data samples. So if your Model is having problems predicting one of these labels, your F1 score will show this more clearly than the Accuracy score.

Confusion Matrix

A confusion matrix is a simple way to inspect how your Model predicted each label on the data attached to each real label. This allows you to quickly spot labels that your Model is having trouble identifying or distinguishing from other labels.

"Real Labels" refer to the labels that you attached to your data, while "Predicted Labels" refer to the labels that your Model predicted were attached to your data. In a confusion matrix, the real and predicted labels form axes that create cells corresponding to each axis. If you add every cell's value in each row together, it will give you the total number of documents in your test data set.

The resulting cells allow you to see how accurate your Model was at predicting each label. For example, below is the confusion matrix for a test data set of 99 documents that were labelled either "positive" or "negative". The cell on the top left shows how many samples in the test data that were marked "negative" that our Model predicted as "negative". Below that you will see the number of samples from our test data that our Model predicted as "positive," and so on.

Confusion matrix

Sample Predictions

In the Sample Predictions section of the Evaluation Results page, you can see the predictions your Model made on samples of the test data from your Dataset.

Sample Predictions essentially is a more granular view of the same evaluation seen in the Confusion Matrix. You can see the real and predicted labels, and see where the computer made inaccurate predictions, but you can also see the document that these labels relate to, allowing you to see for yourself what your Model was basing its predictions off.

Highlighted text

Live Evaluation

Using the Live Evaluation widget, you can enter any text and receive your Model's prediction of which label should be assigned to it. This allows you to try out the kind of inputs your Model will receive in the real world, and see how well it can predict labels for them.

A highly useful tool here is the Explain the Results feature. Using this tool, you can gain a better understanding of what features of your input text were given a higher weight by your Model for making this prediction.

This allows you to ensure that your Model is picking up the right features, and for example is not overfitting on meaningless but highly indicative features that might be in your training data.

For example, the Text with Highlighted Words section returns your sample input to you along with words that have been highlighted according to how much they contributed to your Model predicting the label it predicted for your input.

Highlighted words

The explanation section also includes a graphical display of the words from your input text that your Model attached the heaviest weights to – that is, the words (or the combinations of words) that your Model thought were the most important when making its prediction.

Pro evo graph

Improving Accuracy

The accuracy of your Model depends on one general principle: your Model needs to be trained on a large amount of high-quality, relevant data spread across all labels in your Dataset.

If you don't have enough labeled data, your Model won't have enough information available to recognize patterns in this data, so when you test your Model it won't recognize the same patterns in the test data. When you test your Model, you will see that your Model has poor recognition of these labels.

By high-quality data, we mean data that contains enough distinguishing features that your Model can use to recognize the difference between labels. You should be able to easily tell the difference between the data under different labels. If you can't do this, your Model will also have a tough time telling the difference between them.

Deploying a Model

Once you are happy with your Model's accuracy, you are ready to Deploy your Model into production and make calls to it like any other API. While you train and test your Model, it resides in TAP's Sandbox, where no other software can access it so you can build and test your Model until you are happy with its accuracy. Deploying your Model means that it can be accessed from outside the TAP app, and you can't tweak or retrain your Model until you move it back into the Sandbox.

You can store multiple Models on TAP without deploying them. You will be able to browse all of your stored Models on the Your Models page. The number of Models you can have in production at the same time depends on your pricing plan.

Using the Marketplace

Using Public Datasets & Models

In TAP, it is possible to use Datasets and Models built by the AYLIEN team and other TAP users, provided these resources have been made public. You can see the list of available resources by clicking on the Explore tab, and forking the Dataset or Model to your own account.

Explore tab

Publishing Your Datasets & Models to Marketplace

In TAP it is possible to share the Models with other users.

Tags

In order to make your Model discoverable to other users, add relevant tags to it by clicking on the Manage Tags button on the Model's page.

TAP API

Once you've deployed your Model, you can call it like any other API. Every Model you build will have its own endpoint and as a user you will have your own API key.

Your API key

Once you sign up for TAP, you will receive an API key, which you can see in the API Keys section of the dropdown menu under your username.

Api keys

Calling your Model from an application

To call your Model from an application, you'll need to include the endpoint of the Model and your API key in your code. These details can be viewed for each Model's page:

Model summary

HTTP Request

Your Model acts as a RESTful API and will support GET and POST requests.

  • GET https://api.tap.aylien.com/v1/models/:model_id
  • POST https://api.tap.aylien.com/v1/models/:model_id

Be sure to replace :model_id with the ID of the Model you intend to make the request to.

curl -X POST \
     -H "x-aylien-tap-application-key: YOUR_API_KEY" \
     --data-urlencode "text=Your text here" \
     https://api.tap.aylien.com/v1/models/:model_id
import requests

model_id = "YOUR_MODEL_ID"
url = "https://api.tap.aylien.com/v1/models/" + model_id

payload = "text=Your text here."
headers = {
    'x-aylien-tap-application-key': "YOUR_API_KEY",
    'content-type': "application/x-www-form-urlencoded"
    }

response = requests.request("POST", url, data=payload, headers=headers)

print(response.text)
require 'uri'
require 'net/http'

url = URI("https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID")

http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE

request = Net::HTTP::Post.new(url)
request["x-aylien-tap-application-key"] = 'YOUR_API_KEY'
request["content-type"] = 'application/x-www-form-urlencoded'
request.body = "text=Your text here."

response = http.request(request)
puts response.read_body
var client = new RestClient("https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID");
var request = new RestRequest(Method.POST);
request.AddHeader("content-type", "application/x-www-form-urlencoded");
request.AddHeader("x-aylien-tap-application-key", "YOUR_API_KEY");
request.AddParameter("application/x-www-form-urlencoded", "text=Your text here.", ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
var request = require("request");

var options = { method: 'POST',
  url: 'https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID',
  headers: 
   { 'content-type': 'application/x-www-form-urlencoded',
     'x-aylien-tap-application-key': 'YOUR_API_KEY' },
  form: { text: 'Your text here.' } };

request(options, function (error, response, body) {
  if (error) throw new Error(error);

  console.log(body);
});

<?php

$request = new HttpRequest();
$request->setUrl('https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID');
$request->setMethod(HTTP_METH_POST);

$request->setHeaders(array(
  'content-type' => 'application/x-www-form-urlencoded',
  'x-aylien-tap-application-key' => 'YOUR_API_KEY'
));

$request->setContentType('application/x-www-form-urlencoded');
$request->setPostFields(array(
  'text' => 'Your text here.'
));

try {
  $response = $request->send();

  echo $response->getBody();
} catch (HttpException $ex) {
  echo $ex;
}

?>
HttpResponse<String> response = Unirest.post("https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID")
  .header("x-aylien-tap-application-key", "YOUR_API_KEY")
  .header("content-type", "application/x-www-form-urlencoded")
  .body("text=Your text here.")
  .asString();
package main

import (
    "fmt"
    "strings"
    "net/http"
    "io/ioutil"
)

func main() {

    url := "https://api.tap.aylien.com/v1/models/YOUR_MODEL_ID"

    payload := strings.NewReader("text=Your text here.")

    req, _ := http.NewRequest("POST", url, payload)

    req.Header.Add("x-aylien-tap-application-key", "YOUR_API_KEY")
    req.Header.Add("content-type", "application/x-www-form-urlencoded")

    res, _ := http.DefaultClient.Do(req)

    defer res.Body.Close()
    body, _ := ioutil.ReadAll(res.Body)

    fmt.Println(res)
    fmt.Println(string(body))

}

Authentication

Supply TAP with your API key in the header of the request in the following format:

'x-aylien-tap-application-key': 'YOUR_API_KEY'

Parameters

Each Model in TAP can have its own specific parameters, and you should refer to the Model page to retrieve these. Below we show the default parameters that many TAP Models expose:

Parameter name Data type Description
text String Text to analyze
url String URL to analyze
top_k Integer Number of predictions to return (defaults to 1, can be as large as the number of labels in the dataset)

Please note that the request parameters may be different for certain models, therefore it's best to refer to the Usage Details section of the Model's page to understand its request format.

Returned JSON

When queried, your custom classification Model will return a JSON object containing the predicted label and the confidence score of this prediction. Please note that the response format may be different for certain models, therefore it's best to refer to the Usage Details section of the Model's page to understand its response format.

{
    "text": "John is a great football player",
    "version": "1.0",
    "explanations": "",
    "categories": [{
        "baseline": 0.0,
        "confidence": 0.9,
        "cob": 0.0, 
        "id": "w52c38jm-fd29-8t1h-ad31-56bw2300fl9c",
        "label": "Sports§"
    }]
}

Billing & Analytics

Plans & Billing

Currently TAP is in Public Beta, so we are offering the Free Plan where you can build and deploy one Model free of charge. Upgrading to our Starter Plan will allow you to deploy 4 Models to production, as well as doubling the amount of calls you can make to your API.

You will be notified of any changes to the subscription plan during the Beta period at least one month in advance.

Tap plans

Billing Settings

Managing your billing settings in TAP is easy. By clicking on Payment Methods in the drop-down menu beneath your username, and then Add Payment Method, you can set up and manage your credit card payments.

You can access every invoice you receive for your TAP account under My Invoices.

Monitoring your Usage

The Analytics page provides a detailed breakdown of the calls made to each of your Models over time. You may notice that your Models have been called while not deployed – remember that calls you make during live evaluation will count towards your usage.

Api usage

Common Workflows

Calling TAP Models from Google Spreadsheets

To call your TAP Models from Google Sheets, you simply need to add the following script to your sheet by navigating to Tools > Script editor and copying and pasting the following snippet to replace the contents of Code.gs. Don't forget to replace YOUR_API_KEY at the top of the script with your API key from TAP:

function getAppKey() {
  return "YOUR_API_KEY";
}

function getApiUrl(model_id) {
  return "https://api.tap.aylien.com/v1/models/" + model_id;
}

function validateURL(value) {
  return /^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?$/i.test(value);
}

function detect_type(text) {
  if (validateURL(text)){
    return "url";
  } else {
    return "text";
  }
}

function callTap(params) {
  var payload = {};
  var maxTries = 3;

  if (!params["value"] || params["value"].trim() == '')
    return [];

  params["value"] = params["value"].trim();
  var type = detect_type(params["value"]);

  if (type == "url")
    payload["url"] = params["value"]
  else
    payload["text"] = params["value"]

  payload["top_k"] = parseInt(params["top_k"] || 1).toString();

  var result;
  var count = 0;
  var catch_count = 0;

  while(true) {
    var api_url = getApiUrl(params["model_id"]);
    var options = {
      "method" : "post",
      "headers": {
        "x-aylien-tap-application-key": getAppKey()
      },
      "payload": payload,
      "muteHttpExceptions": true
    };

    result = UrlFetchApp.fetch(api_url, options);

    var status_code = result.getResponseCode();
    if (status_code !== 200) {
      Utilities.sleep(5000 * (count + 1));
      if (++count == maxTries) throw "An error occured in TAP";
    } else {
      break;
    }
  }


  var response = JSON.parse(result);

  return response.categories;
}

/**
 * Returns the analysis results from a TAP Model for a text or URL.
 *
 * @param {text} value Text or URL (link) to be analyzed
 * @param {text} model_id ID of the TAP Model you would like to run
 * @param {number} top_k Number of predictions to get back
 * @return The analysis results from TAP as `label` and `confidence` (2 columns)
 * @customfunction
 */
function TAPResults(value, model_id, top_k) {
  if (value.map) {
    return value.map(TAPResults);
  } else {
    var response = callTap({value: value, model_id: model_id, top_k: top_k});
    var categories = [];

    for (var j=0; j < response.length; j++) {
      categories.push(response[j].label);
      categories.push(response[j].confidence);
    }

    return [categories];
  }
}

Once you have added the above snippet to Code.gs and saved the file, you can invoke the script in your spreadsheet using a formula like this:

=TAPResults(A1, "YOUR_MODEL_ID", 1)

Where A1 is the cell containing the text or URL you're looking to analyze, and "YOUR_MODEL_ID" is the ID of your TAP Model, which is the bit that follows https://api.tap.aylien.com/v1/models/ in your Model's endpoint (e.g. 3baf3bc0-c0bb-4503-bdd5-d22341ff209f). Please ensure that your Model is deployed before trying to call it from Google Sheets.

Resources

FAQs

What is the Text Analysis Platform (TAP)? The Text Analysis Platform is a simple way to build advanced Natural Language Processing tools. It allows you to build custom language models that you can use to generate the insights you need for the solution you have in mind. TAP allows you to easily build your own API that is customized for your use case.

Why should I consider using TAP if I am currently using Text or News API? Out-of-the-box APIs from any provider are not optimized for anyone in particular, they are designed with everyone in mind. You may find you fall into one of two categories:

  • You aren't getting the results you want from the API you are currently using, and you need a fine-tuned model for your solution.

  • There simply isn't an API that directly addresses the issue that you are facing, so you cannot properly leverage NLP to generate insights.

How does TAP work? TAP lets you build a custom model by training it on labeled data that is a good example of the data you want to use it on. During training, your new Model uses AYLIEN's Deep Learning-powered NLP to recognize patterns in your Dataset. When you are finished training your Model, you can deploy it into whatever solution you are building, and it will recognize the same patterns in new data.

How do I use TAP to create custom a NLP API? Creating a custom NLP API contains three steps. First, you'll gather sample data and label it with the labels you would want your Model to label it with. Second, you let TAP train your Model on this Dataset, and repeat these first two steps until you are satisfied with your Model's accuracy. This involves tweaking your Dataset so your Model understands enough to accurately predict every label. Finally, you click Deploy and paste your Model's endpoint into your code. After this, you can make calls to your custom Model like any other API.

How do I sign up to start using TAP? Getting started with TAP is easy, and you can build and test a Model for free – you will only need to move to a paid plan once you deploy your Model.

What is the pricing for TAP? The pricing for TAP is subscription-based, and each subscription tier allows you to deploy a different number of Models at one time. You can build, test, and store as many models as you like for free. TAP is currently in Public Beta testing, so we are offering two subscription plans:

  • The Free plan allows you to deploy a Model at no cost.
  • The Starter plan (€49.99/month) allows you to deploy up to four Models at once, and also provides higher rate limits.

If you want to deploy more than four Models at once, get in touch with sales@aylien.com.

Glossary

API Key

An API Key is a string of 32 letters and numbers that allows our servers to recognise you and give you to use an API. As an AYLIEN user, you will receive an API Key for each model you create, which you will then insert into your code to access that model. You can find your API Keys by clicking on your username in the top right corner of the TAP site and selecting the API Key option from the dropdown menu.

Boolean Search

A Boolean Search is a search for more than one keyword or parameter at once. This search is carried out by using operators such as "AND," "NOT," and "OR" in the query. These operators broaden or refine the results of your search. AND and NOT queries allow you to refine your search by excluding concepts or making search terms more detailed, whereas OR queries broaden your search by including a wider range of results.

For example, this is how Boolean Search works in TAP's Knowledge Base:

  • "camera AND lens," = documents that contain both "camera" and "lens."
  • "camera NOT lens" = documents that contain the word "camera" but not the word "lens."
  • "camera OR lens" = documents that contain the word "camera" and also documents that contain the word "lens."

Classification

Text classification refers to how automated language models classify documents according to taxonomies they have been trained to recognize. This can involve classifying the subject matter of a news story or the sentiment of a review. When a model classifies a document, it is predicting what label to assign to text data, like assigning a "sport" label to a story that frequently mentions sports, or a "positive" label to a review that contains positive words. The model bases its predictions on what it has learned during training.

With TAP, users decide which labels to assign to data and train language models to understand the text they deal with, allowing endless applications. For example, instead of classifying news stories by their subject matter, they can classify incoming communications and route them to the correct staff member.

Collections

Before you upload your CSV file to create your dataset and train your model, you also define exactly which rows your model will train on and which rows it will be tested on. This process can be carried out automatically by splitting your data, or in this case manually, by splitting your data into "Training" or "Test" portions.

  • Training: the "Training" label simply tells your model to train on these rows, and ignore the rest of the data.
  • Test: the "Test" label tells your Model which documents to ignore until it has been trained, at which point it will try to predict the labels that are attached to the documents in this test collection.

Confusion Matrix

A confusion matrix shows how accurate your model was at predicting each label across every category. For example, if you classified your training data with five labels: "a," "b," "c," "d," and "e," your model could classify each data point with any one of these five labels. After you train and test your model, the confusion matrix shows you how many documents that you had labelled as "a" that your model correctly labelled as "a," but also how many documents you had labelled as "b" that your model predicted as "a," and so on.

A confusion matrix is essentially a very detailed accuracy score, presented to you in a grid so that you can see all of the data at once in a clear way. Seeing this data will help you identify common mistakes your model is making.

CSV File

A Comma Separated Value file is a type of file where values (such as text or numbers) are separated by commas. This allows computers to easily arrange the data as rows, which we can then view as spreadsheets, and analyze large amounts of entries at once.

Dataset

In TAP, a dataset is a collection of text documents with a label attached to each, which users can train new language models on. TAP allows users to either upload their own dataset in CSV format, or to create their own dataset by using Knowledge Base.

Deploying a model

After you have trained your model, deploying it simply means moving it from the production stage (where you build, train, and test your model), to the deployment stage (where you can access it from whatever solution you're building).  Your model remains in the cloud, the only difference is that you can now make calls to your model just like any other API, using the API key you have received for that model.

Forking

In TAP, users can fork another user's Model or Dataset if it has been made public: this means that by clicking the Fork button, you are copying that Dataset to your own TAP account for your own use, without affecting the original.

F1

The F1 score is a weighted average of the Precision and Recall scores. The reason we include this is because it's important to measure both at once – it's no good if your model is very precise but has very little recall. In other words, if your model makes one prediction and predicts the correct classification, it will have a perfect prediction score of 1 (meaning it is totally accurate), but there could be hundreds of other data points that it didn't predict so the recall score will be terrible. The recall score measures how comprehensive your model's accurate predictions are, cross-referenced with how many mistakes it made while making the predictions.

For example, in a sentiment analysis model, a precision score asks: "of all of the positive predictions, how many were actually positive?" A recall score asks: "of all of the documents that were positive, how many did my model identify as positive?"

The F1 score is a weighted average of the two scores, to give you a quick but accurate measurement of both of them in one score.

Hyperparameters

While training your model, the hyperparameters specify exactly how your model will be trained on the dataset you're training it on. They are the specific methods that your model will use to recognize patterns in the data, and TAP users can determine these methods in the Advanced Settings tab of the "Train your Model" page.

These can be quite advanced, so users can also select one of two preset hyperparameters that are optimized for longer news content and shorter social media content.

Iteration

Iteration refers to a design process that involves repeating certain steps to improve a design, usually based on evidence from testing. Instead of planning every stage of development at the beginning, an iterative process includes tests of designs, and allows for the changing of designs based on the results of these tests. This process works well when building a language model because it allows users to refine their model and train it to produce the most accurate predictions.

Knowledge Base

In TAP, our Knowledge Base is a database of hundreds of millions of documents that we have tagged with labels so that you can train your model on them. These documents are from different sources and enriched with different labels. They include recent news articles gathered by our News API, historical news articles from the New York Times, encyclopedia entries from Wikipedia, and product reviews from Amazon. The documents in each of these sources contain text generated by people, and we have labelled each piece of text with tags relevant to the text.

For example, there are over 80 million Amazon product reviews that contain the text of the review, and each review is labelled with tags like the rating the reviewer left (on a scale of one to five stars). From this, you can use reviews with one star to train a model to recognise negative sentiment, and reviews with five stars to train a model on positive sentiment. You can also search for documents about a specific subject to train a model to recognize text about that subject.

Label

When dealing with data, a label is a tag that either we, or our language model, assign to an individual piece of data. For example, a user can label a document about football with the tag "football," or a positive Tweet with the tag "Positive." During the training process, the model analyzes the data linked with each label, and after you deploy your model, it will assign one of these labels to new data, based on what it has learned.

Model

A custom model is a tool that a user can use to extract insights from textual content. TAP allows users to easily build their own models to extract exactly what is relevant to their requirements. In TAP, a user builds their model by training it on data that is pre-labelled with relevant tags, tested to see how accurate it is at predicting these tags on new data. When the user is happy with the accuracy of the model, they can use this model to label new text with the labels it has been trained to classify with. This allows users to automate analysis of any type of documents and extract information that they have told the model is relevant.

Model Status

  • Stopped (sandbox): This status message is reporting that your model has stopped.
  • Not Ready (sandbox): While your model is being trained, this status message is displayed. At this point, your model is using AYLIEN's technology to analyze the training data.
  • Running (sandbox): This status shows that your model is in the sandbox stage, but not currently being trained.
  • Running (production): After you have finished training your model and you have deployed it, this status is shown in green to display that your model is ready for real-world data.

Node

A node is like a folder that contains data of a certain type. All of the data files attached to a node contain labels that define that node, like files in a folder marked "football" will contain documents about football. And just like a subfolder can exist within another folder, a node can be linked to another node within a structured relationship.

Parameters

Parameters determine the limits of a query or investigation. For example, if you search for "Donald Trump" in our News API, it will return every story about Donald Trump. However if you set the language parameter to "language[]=English" it will ignore every article that is in a language other than English. This is because parameters define the boundaries of what the search is carried out within.

Precision

When testing your model, the precision score shows how accurate the predictions made by your model were. More specifically, it measures how many of the predictions that it made were correct.

Imagine, for example, we are testing our model on 1,000 Tweets we know are positive and 1,000 Tweets we know are negative. Most of the time, the our model will classify positive Tweets as "Positive" and negative Tweets as "Negative." But it will also make mistakes, sometimes classifying positive Tweets as "Negative" and vice versa. The precision score measures, out of all the predictions of "Positive," how many were actually positive. In other words, it measures how precise the predictions it made were.

The precision is measured on a scale of 0 to 1, with 0 being completely inaccurate and 1 being totally accurate.

Recall

When testing your model, the recall score shows how comprehensive your accurate predictions were. Again, imagine we have 1,000 Tweets we know are positive and 1,000 Tweets we know are negative. Most of the time, our language model will classify positive Tweets as "Positive" and negative Tweets as "Negative." But it will also make mistakes, sometimes classifying some positive Tweets as "Negative" and negative Tweets as "Positive." The recall score measures, out of all of the Tweets we know are positive, how many did the model classify as "Positive"? And how many did it miss? In other words, of all the positive Tweets, how many did the model recall as "Positive?"

On a scale of 0 to 1, a score of 0.2 would indicate that the model did not classify many positive Tweets as "Positive," whereas a score of .9 would indicate that of all the positive Tweets, our classifier classified 90% of them as "Positive."

Running Jobs

The Running Jobs page on TAP shows you models that are currently being trained on AYLIEN's remote storage. At this point, your model is analyzing the training data that you supplied to it. While any model is being trained, there will be a small notification at the top right of the screen. When the training job is finished, you can access your model on the My Models page.

Sandbox

Sandbox refers to the section of the TAP site where users can develop Models without deploying it outside of that section of the site. This is so that any bugs or errors in the software will not affect anything outside the sandbox, and that no data or inputs from outside the sandbox can interfere with the development of the software.

Sentiment Analysis

Sentiment Analysis models predict the sentiment of pieces of text by subjectivity (whether a piece of text is subjective or objective) and polarity (if it is subjective, does it express positive, negative, or neutral sentiment). TAP allows users to customize sentiment analysis models that are specific to their own domains by training them on the data as similar as possible to the data that it will analyze.

Splitting

When you split your dataset, you are allocating a certain percentage of your data to train your model, and holding back the remainder so you can test the model after it is trained. This allows you to test your model on data that is as similar to the data it was trained on as possible, meaning your test will give the clearest possible picture of how accurate your model is.

For example, if you split your data into 90% training and 10% test data, your model will be trained by using the first 90% of your data to learn the relationship between documents and labels, and it will then try to predict the labels of the remaining 10%. You can then review how accurately your model predicted the examples in this 10%.

Training Data

Training data is simply data you have gathered that you are going to train your model on. This data is labelled – for example a Tweet that has a positive tone will be labelled "Positive" and a Tweet with a negative tone will be tagged as "Negative." This data is stored in a CSV file, so it looks like a spreadsheet where one cell will have a positive Tweet and the cell next to it will say "Positive." This allows your model to analyze the Tweets and understand what positive and negative Tweets look like. When it is finished training, you can show it unclassified Tweets and it can tell you whether it is positive or negative.

In TAP, you can bring your own training data that you have labelled yourself or you can use our Knowledge Base to select data that has been labelled for you.