Information Storage and Retrieval: From MEDLARS to Twitter
Using the MeSH browser, you can enter search terms or surf to a page that enables you to browse the tree structure. If you enter the term "Heart", you'll find an extensive list of related subjects. If, for example, you click on "Heart Failure" (C14.280.434), you'll find a page that shows a number of entry terms related to heart failure, including cardiac failure, congestive heart failure, left-sided heart failure, myocardial failure, and so on. If you examine node C14.280.434, you'll see sub-nodes such as edema, cardiac (C14.280.434.482), Heart Failure, Diastolic (C14.280.434.611), Heart Failure, Diastolic (C14.280.434.676) and more. A term such as "heart failure" can carry a unique term identifier, a concept identifier, a semantic type, a preferred term, a preferred concept, alternative term, alternative concept, lexical tag, and ID of the appropriate NLM thesaurus.
Keep in mind this is not how you execute the actual queries of decades of bibliographic citations. It's how you navigate the subject headings of the vocabulary used to interrogate PubMed, the database of medical literature at NLM.
Twitter arrived on the information storage and retrieval scene five decades after the original MEDLARS went online. The cost of computing at inception had a lot to do with the size of the audience that each could serve. The MEDLARS database originally served a closed community of healthcare professionals, before inexpensive computers and low-cost bandwidth permitted it to serve a web audience. Twitter's charter is to build as large a user community as possible, sustaining as much timely content as possible in the process.
Twitter supports a REST API, exposing GET and POST methods for working with data about tweets, timelines, searches, direct messages, friends, and followers. It also has APIs for processing data about users, suggested users, favorites, lists, accounts, saved searches, trends, locations, help, authentication, and spam reporting.
Tweets are essentially a bucket into which we can pour free-form text and entities such as lists, hash tags, URLs, and media. Twitter users often find there is value to classifying the content of their tweets by including one or more hash tags that are searchable entities. Media entities include attributes such as
id, media_url, indices, sizes and
Twitter does not currently create a historical archive of tweets for searching by the user community. Creating an archive today would require writing an app that uses the Streaming API to get tweets in real-time and store them. The Twitter Search API provides a search of the real-time index of tweets. It's a vehicle for searches of tweets during the past 6-9 days. Search supports a
since_id attribute for specifying a start date for searches.
The MEDLARS example illustrates an information storage and retrieval system designed to archive information for decades of searching. Twitter is an example of a system intended to handle a huge daily volume of information, with scalability, polyglot persistence, and restricted search and archiving capabilities. For tweets, analytics, and other data, Twitter uses multiple data storage and retrieval solutions, including Cassandra, Hadoop, Hive, Pig, Vertica, and MySQL. As of December 2011, Twitter was storing 250 million tweets per day with a data store built using MySQL. It's an example of an information storage and retrieval system that's responsive to an enormous user community because it scales out to handle Big Data volumes. Hash tags simplify searching, but the data available to users is of a variety that has a short shelf life.