Summary Generation for Course Content Related Questions

ENG6420 (2021-22S1)


This system is developed by VIP Research Group
under the supervision of Dr. Maiga Chang, Professor at the School of Computing and Information Systems, Athabasca University.

About the Project

A course has many learning objects (e.g., reading materials, external webpages, assignments, etc.) that have a lot of content and knowledge for students to learn. The research designed and developed a summary generation service based on the content that the machine went through, read and digested behind the scene with its own free time automatically.

  • The service will identify the keyphrases from a question entered by a user with Natural Language Processing techniques.
  • The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

About Us

...
Our Mission

The goal is to use Natural Language Processing concepts and techniques to make computer capable of reading text-based content and extract important keyphrases from every sentences. The same technique is adopted by the computer so it can identify the similar content for user questions and generate corresponding summary for the users.

...
Our Supervisor

Dr. Maiga Chang is a Full Professor in the School of Computing and Information Systems at Athabasca University, Canada.

...
Research Goal

The research goal is to have the automatic answering service correctly identifying the keys from a question and summarizing the associated content that is relevant to the question and makes the user satisfied.

Our Team

...
Maria IRIARTE
2022

Maria is a Charter Civil Engineer specialized in data science and spatial analysis for large public and environmental infrastructures, holding a Master's Degree in Big Data & Visual Analytics and in Occupational Risk Prevention. Maria is currently writing her PhD doctoral thesis in Computer Science at the International University of La Rioja and pursuing a Master of Science in Information Systems at Athabasca University.

...
Supun DE SILVA
2022

Short Bio to be coming

...
Jayed RAFI
2022

Jayed Rafi is an undergraduate student in the department of computer science at the University of Manitoba. His experience involves software engineering in automation and web technology and his research interests are natural language processing & artificial intelligence.

Videos


Presentation Video

Live demonstrations on a 12-weeks work outcome (May 2021~July 2021) of the preliminary function of Coronavirus Question Answering research in Python, PHP, JavaScript (AJAX and JSON), and Natural Language Processing basics. It includes three stages:

  1. Stage 1: File Extraction and Verification
  2. Stage 2: Data Processing
  3. Stage 3: Summary Generation


Stage 1: File Extraction and Verification

Stage 1's major features include (but not limited to)

  1. File extraction and verification on uploaded CORD-19 dataset in compressed tar and/or gz file format.
  2. Cron jobs for the backend services.
  3. Dashboard that shows backend services' working progress.


Stage 2: Data Processing

Stage 2's major features include (but not limited to)

  1. Processing CORD-19 dataset's fulltext in JSON format.
  2. Analyzing and summarizing useful Part-of-Speech (PoS) tags in CORD-19 dataset.
  3. Storing useful PoS tags and relevant n-grams (n is from 1 to 4).


Stage 3: Summary Generation

Stage 3's major features include (but not limited to)

  1. Web-based user interface for users to ask their questions related to coronavirus.
  2. Consine similarity calculation based on the extracted useful PoS tags and their correspondent n-grams from the asked question.
  3. Summary generation and storing for the asked questions.

Frequenty Asked Questions

  • You can ask anything related to COVID-19. This service aims to generate a summary based on the question asked. You can take a look here to have some idea about the questions. Some sample questions are, Should I use soap and water or hand santizer to protect against COVID-19? Can mosquitoes or ticks spread the virus that causes COVID-19?

  • We are strictly against the idea of collecting user-sensitive data. We store an anonymous system-generated user UUID as a cookie that identifies the user. We keep the question and the summary generated in our database and store them in cookies for a better experience.

  • We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. Then the summary is generated using data mining techniques. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked. The service improves with time. It will provide better results in the future.

  • We are using the CORD-19 dataset. CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. When the user asks a question, we identify the keywords (N-grams) from the question and create a vector representation. A vector consists of the frequency of the keywords. We compare it to the document vectors by cosine similarity. This way, we identify the best document. Then similarly, we match the target vector to the sentences by cosine similarity to find the best matching sentences. We concatenate the top two sentences to generate the summary.

  • Yes, the service aims to generate a summary based on your question. We have extracted data from tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. It does not guarantee a specific answer to the question, but you can get a general idea about the question asked.

  • No, we are against storing user's personal information. No questions will be shared or publicly available. We keep the question and the summary in our database for a better user experience. We uniquely identify you by a system-generated UUID and store it in a cookie.

  • We are periodically running backend services to process a large amount of this pure text data (i.e., 236,336 academic full-text literature as of July 19, 2021) with basic Natural Language Processing techniques that include tokenization, n-grams extraction, and part-of-speech tagging. We are processing the documents by extracting the sentences, followed by the extraction of Ngrams in a sentence. The research goal is to have an automatic answering service that correctly identifies the keys from a question. It summarizes the associated content that is relevant to the question and makes the user satisfied.