Intern, Linguistic Data Engineering Team - Summer 2018

Basis Technology, Cambridge, MA

Data is messy. We make sense of it.

About the Job

Basis Technology is seeking a Linguistic Data Engineering Intern to be a part of a growing data team in support of several text analytics products. This person will work with multiple discrete engineering teams providing quality data to evaluate and further the development of natural language processing tools as well as consult on the language specific aspects of multilingual text.



  • Assist with managing large scale text mining, data acquisition and annotation projects
  • Derive meaningful metrics from data annotation tasks
  • Describe and demonstrate linguistic phenomena on a variety of languages
  • Survey and Catalogue new data releases and best practices in data maintenance, conversion and analytics


  • strong scripting abilities, especially python
  • Ability to write and revise annotation guidelines
  • Knowledge of Linguistics including
    • tokenization
    • part of speech
    • morphology
    • grammar structures
  • Familiarity with linguistic community resources
    • especially treebanks but also
    • CredBank, ClueWeb, CommonCrawl and other AWS hosted sets
  • Experience working with and modifying annotation tools such as WebAnno, brat, GATE
  • Nice to have:
    • Experience working with crowdsourcing platforms, Mechanical Turk, Crowdflower
    • Experience with finite state automata, especially Xerox FST
    • Proficiency in at least one language in addition to English
    • Experience with conversion, storage, version control and maintenance tasks for large multilingual text collections

About Basis Technology

About We’re the leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices Our mission is to improve the process of extracting meaningful intelligence from unstructured multilingual text and digital devices by developing the industry’s best software. Since our founding in 1995, our products and services have been used by over two hundred major firms, including Amazon.com, EMC, Endeca/Oracle, Exalead/Dassault, Fujitsu, Google, Hewlett-Packard, Microsoft, Oracle, and governments around the world. Our language analysis and digital forensics software are widely used in the U.S. defense and intelligence industry by firms such as BBN, CACI, Lockheed Martin, MITRE, Northrop Grumman, and SAIC. We’re also the top provider of Asian linguistic technology to web search engines, including Ask.com, Google, Microsoft Bing, and Yahoo!. Through these relationships, we’ve developed a reputation for deep expertise in computational linguistics and digital forensics, an uncompromising commitment to providing quality software and services, and dedication to serving our customers’ needs with unparalleled support. Our Rosette linguistics platform uses state of the art natural language processing techniques to improve information retrieval, text mining, machine learning, statistics, and computational linguistics. Rosette provides capabilities like identifying the language of incoming text, providing a normalized representation in Unicode, and locating names, places and other key concepts from a body of unstructured text. Rosette is the world’s most widely-used family of commercial software products for multilingual information retrieval. Its reliability, scalability, accuracy, and strict compliance with industry and international standards have been put to the test in high volume transaction environments, such as Google’s multilingual search engine, PeopleSoft’s human capital management software, and Amazon.com’s global e-commerce system. Our digital forensics group provides investigative services and software to make examiners more efficient. They develop solutions that focus on speed, ease of use, and extensibility for incident responders, corporate investigators, law enforcement, and the military. The digital forensics group are the primary contributors to several open source projects, including Autopsy and The Sleuth Kit, for which they provide commercial add-ons, training, and support.

Want to learn more about Basis Technology? Visit Basis Technology's website.