Channels ▼

Ken North

Dr. Dobb's Bloggers

Information Storage and Retrieval: From MEDLARS to Twitter

March 01, 2012

Using the MeSH browser, you can enter search terms or surf to a page that enables you to browse the tree structure. If you enter the term "Heart", you'll find an extensive list of related subjects. If, for example, you click on "Heart Failure" (C14.280.434), you'll find a page that shows a number of entry terms related to heart failure, including cardiac failure, congestive heart failure, left-sided heart failure, myocardial failure, and so on. If you examine node C14.280.434, you'll see sub-nodes such as edema, cardiac (C14.280.434.482), Heart Failure, Diastolic (C14.280.434.611), Heart Failure, Diastolic (C14.280.434.676) and more. A term such as "heart failure" can carry a unique term identifier, a concept identifier, a semantic type, a preferred term, a preferred concept, alternative term, alternative concept, lexical tag, and ID of the appropriate NLM thesaurus.

Keep in mind this is not how you execute the actual queries of decades of bibliographic citations. It's how you navigate the subject headings of the vocabulary used to interrogate PubMed, the database of medical literature at NLM.

Twitter

Twitter arrived on the information storage and retrieval scene five decades after the original MEDLARS went online. The cost of computing at inception had a lot to do with the size of the audience that each could serve. The MEDLARS database originally served a closed community of healthcare professionals, before inexpensive computers and low-cost bandwidth permitted it to serve a web audience. Twitter's charter is to build as large a user community as possible, sustaining as much timely content as possible in the process.

Twitter supports a REST API, exposing GET and POST methods for working with data about tweets, timelines, searches, direct messages, friends, and followers. It also has APIs for processing data about users, suggested users, favorites, lists, accounts, saved searches, trends, locations, help, authentication, and spam reporting.

Tweets are essentially a bucket into which we can pour free-form text and entities such as lists, hash tags, URLs, and media. Twitter users often find there is value to classifying the content of their tweets by including one or more hash tags that are searchable entities. Media entities include attributes such as id, media_url, indices, sizes and type.

Twitter does not currently create a historical archive of tweets for searching by the user community. Creating an archive today would require writing an app that uses the Streaming API to get tweets in real-time and store them. The Twitter Search API provides a search of the real-time index of tweets. It's a vehicle for searches of tweets during the past 6-9 days. Search supports a since_id attribute for specifying a start date for searches.

The MEDLARS example illustrates an information storage and retrieval system designed to archive information for decades of searching. Twitter is an example of a system intended to handle a huge daily volume of information, with scalability, polyglot persistence, and restricted search and archiving capabilities. For tweets, analytics, and other data, Twitter uses multiple data storage and retrieval solutions, including Cassandra, Hadoop, Hive, Pig, Vertica, and MySQL. As of December 2011, Twitter was storing 250 million tweets per day with a data store built using MySQL. It's an example of an information storage and retrieval system that's responsive to an enormous user community because it scales out to handle Big Data volumes. Hash tags simplify searching, but the data available to users is of a variety that has a short shelf life.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Dr. Dobb's TV