Clustering engine for millions of documents and gigabytes of text

Free trial
2d zoomable map of PubMed articles on vaccination produced by Lingo4G large-scale text clustering engine.

Lingo4G is a software component you can use to implement interactive exploration of millions of documents spanning gigabytes of text.

Lingo4G Explorer, pictured in the screen shot, shows what you can build with Lingo4G REST API. Lingo4G Explorer is also a great tool for quick experiments with Lingo4G-powered large-scale text clustering.

The screen shot shows a 2d map visualization of 3.5k medical paper abstracts related to vaccines. Textually-similar abstracts are clustered together on the map. The panel on the right shows the top abstracts lying in the map region described as measles vaccine. Phases specific to the selected map region are highlighted in the text of the abstract.

Note: analysis presented in the screen shot took 2.76s to compute on a modern 16-core workstation with SSD storage. Analysis times will vary depending on the parameters of your hardware.

Meaningful insights from large quantities of text

Lingo4G enables interactive exploration of millions of documents and gigabytes of text. In near-real-time, fully automatically, without external knowledge bases.

Bird's eye view

Get an overview of the topics covered in thousands of documents, within seconds.

In-depth exploration

Quickly identify documents of interest and visualize relationships between them.

Engaging visualizations

Combine topics, clusters and 2d document maps into powerful visualizations.

A graph of gene-related topics extracted from 350k PubMed abstracts by Lingo4G large-scale text clustering engine.

Lingo4G can extract the topics discussed in hundreds of thousands of documents within seconds.

The screen shot shows subtopics of the gene topic identified in nearly 350k abstracts of medical articles. The graph presents lexical relationships between subtopics, while the content view on the right side shows topical phrases in context by highlighting them in the text of the analyzed documents.

Note: analysis presented in the screen shot took 2.76s to compute on a modern 16-core workstation with SSD storage. Analysis times will vary depending on the parameters of your hardware.

Near-real-time topic discovery

Lingo4G can extract the topics discussed in hundreds of thousands of documents, along with lexical relationships between them, within seconds.

Topical phrases in context

Lingo4G can highlight selected topical phrases in the document text to put them in context and bring up the relevant parts.

Large-scale processing

Lingo4G can arrange hundreds of thousands of documents into non-overlapping clusters and 2d maps to help plan, execute and refine research.

Data slicing and filtering

Choose the document subset to analyze by typing a query, picking an area from the document map or selecting a topic or cluster to drill down on.

An interactive map of 128k SuperUser.com questions produced by Lingo4G large-scale text clustering engine.

On modern hardware, Lingo4G can generate document clustering and map visualizations for hundreds of thousands of documents within minutes.

The screen shot shows the map of 128k SuperUser.com questions. Each dot represents one document, colors correspond to top-level document clusters.

Different thematic areas are clearly visible: Excel-related questions in the top-right corner, network issues in the bottom-left corner, posts about web browsers in the middle. Smaller outlier groups, such as questions about disk drives (RAID, SATA) are also highlighted.

Note: analysis presented in the screen shot took about 1 minute to compute on a modern 16-core workstation with SSD storage. Analysis times will vary depending on the parameters of your hardware.

Treemap of document clusters produced by Lingo4G large-scale text clustering engine and FoamTree.

Lingo4G can apply analyses to an arbitrary subset of your collection. The screen shot shows 11k SuperUser.com posts containing the word office divided into non-overlapping hierarchical clusters, presented using Carrot Search FoamTree treemap visualization component.

Various analysis parameters, such as the number of clusters, can be changed at runtime. The screen shot shows the analysis subset query parameter along with a fraction of document clustering parameters.

Note: analysis presented in the screen shot took 5.62s to compute on a modern 16-core workstation with SSD storage. Analysis times will vary depending on the parameters of your hardware.

Extensive tuning

Lingo4G exposes fine-grained parameters for adjusting the number of topics and clusters, editing stop lists to exclude unwanted topic labels and more.

Interactive exploration

Combine Lingo4G JSON-based REST API with visualization components, such as FoamTree or Circles, to build interactive text exploration tools.

Fast, automatic, easy to integrate

Near-real-time processing

Once Lingo4G indexes your collection, it can extract topics, themes and document clusters within seconds.

Scalability

Topic discovery takes seconds regardless of whether you're processing a hundred or a hundred of thousands of documents.

No external taxonomies

Lingo4G processes documents based only on their textual content, no external dictionaries or taxonomies required.

Stop word discovery

Lingo4G will automatically identify the meaningless phrases specific to your data, such as present invention for patent data.

Full text search

Should you need the good old full text search over your collection, Lingo4G can do that too.

Tuning

The Lingo4G Explorer application will let you get started quickly and tune every aspect of topic extraction and clustering.

Fast indexing

Lingo4G can index 200–2000 MB of text per minute. Adding, updating or deleting docs does not require reindexing.

Easy integration

Lingo4G exposes a JSON-based REST API you can call from any programming language to get analysis results.

Custom applications

Use REST API to build more complex apps, such as finding textually similar documents or nearest-neighbor classification.

Questions & Answers

What are the applications of Lingo4G?

The natural use case is exploration of large volumes of human-readable text, such as scientific papers, business or legal documents.

Out of the box, Lingo4G can give an instant overview of the topics discussed in the whole collection or in the requested subset of it and thus help the analysts to plan, execute and report on their research.

You can use Lingo4G REST API to build more complex applications, such as recommendation of content-wise similar documents or nearest-neighbor classification.

You can also combine Lingo4G REST API with visualization components, such as FoamTree, to build interactive text exploration applications.

What is the largest collection Lingo4G can handle?

On modern hardware with a high-core-count CPU and fast SSD storage, Lingo4G can handle collections reaching hundreds of gigabytes or a terabyte of text.

If you'd like to test Lingo4G on such a large data set, Lingo4G comes with built-in support for indexing patent grant and application documents available from US Patent and Trademark Office. The collection is currently about 500 GB of text.

One important factor to consider is that currently Lingo4G does not offer distributed processing. This means that the maximum reasonable size of the project will be limited by the amount of RAM, disk space and processing power available on a single virtual or physical server.

Which languages does Lingo4G support?

Currently, Lingo4G can only process English text. If you'd like to apply Lingo4G to content written in a different language, please contact us.

What are the system requirements for Lingo4G?

Lingo4G can run on any platform supporting Java 1.8 or later. While processing cannot currently be distributed to multiple machines, a high-end workstation with fast SSD storage should be capable of handling collections of several tens of gigabytes. For most data sets not exceeding gigabytes, any computer with 4GB of memory and some disk space will be sufficient. We very much recommend using SSD drives to store Lingo4G indices. Please see the Requirements section of Lingo4G manual for more details.

How is Lingo4G licensed?

We require one Lingo4G license per one physical or virtual server that runs Lingo4G binaries, regardless of the number of cores on the server, the number of users and number of collections handled by the server.

For large-scale or non-typical deployment scenarios, such as OEM distribution, please get in touch.

How many collections can I process on one server?

There are no restrictions on the number of Lingo4G instances running on one physical or virtual server. The only limit may be the capacity of the server, including RAM size, disk space and the number of CPUs.

What is the cost of a Lingo4G license?

The cost of a license depends on the edition, please contact us for a quote.

Can I get a trial license?

Absolutely! Please get in touch for a free evaluation package.

I have a Lingo3G license, will I receive Lingo4G as an upgrade?

No. Lingo3G and Lingo4G are two separate products we intend to offer and maintain independently. Lingo3G will remain an engine for real-time clustering of small and medium collections, while Lingo4G will address clustering of large data sets. Therefore, Lingo4G is not an upgrade to Lingo3G, but a complementary offering.

Having said that, if you would like to switch from Lingo3G to Lingo4G, we offer a license trade-in option and count the initial Lingo3G license purchase fee towards the Lingo4G license fee.

Can I use the dotAtlas map visualization component in my application?

The dotAtlas map visualization component shipping with Lingo4G Explorer is currently pre-release software. It's been battle-tested for months by early adopters, but lacks finalized API and documentation.

If you'd like to try integrating dotAtlas into your software, please let us know. We'll be happy to share the pre-release version along with code examples and initial guidance.

We will not charge any extra fees for the pre-release versions of dotAtlas. Once it enters the official product suite, the use of dotAtlas will require a license fee similar to the one that applies for Carrot Search FoamTree.

Next steps