REAL-TIME TEXT ANALYSIS WITH CHUNKERS

SEM site scraping code

Project description

This project explores the use of chunkers for real-time text data analysis, focusing on the unique challenges posed by data recorded in real-time, including cursor positions, typed or deleted characters, spaces, and final texts. Chunking, a linguistic analysis technique that groups words into larger units like noun phrases or verb phrases, is applied to these data to understand its effectiveness and limitations in real-time scenarios.

The project aimed to assess how well chunkers can handle real-time text data and to identify any associated challenges and opportunities. We began by formulating research questions concerning the ability of chunkers to segment text accurately and the relationship between pauses and word groups. The corpus used was recorded in three stages of the writing process: planning, formulation, and revision.

Several chunkers were evaluated, including SEM, TreeTagger, SpaCy, and NLTK, based on their recall, precision, and F-measure scores. Data preparation involved reconstructing texts from recordings and creating bigram phrases for chunker analysis. We used Selenium for automated data scraping to gather chunking results from SEM, simulating human navigation on the website.

Analysis of the data revealed that pauses frequently occur at chunk boundaries, particularly between prepositional and nominal groups, highlighting the role of pauses in text segmentation. The results emphasize the effectiveness of SEM for analyzing real-time text data and suggest that while chunkers can segment data accurately, they also face specific challenges that can be addressed in future research.

Discover more about this project and click on the button below to access the GitHub Repository.


Explore More Projects

If you're interested in exploring more projects, please select another project from the dropdown menu.