The Tech Platform

Apr 28, 20223 min

Parsing Text File By Span and Memory

We will write a method and the method will read all of the text inside the file then it will count the occurrences of the words. Like a word how many times used in the text.


 
For example, we have a text file like this:

Murat Pikacu
 
Charmander Murat Pikacu

And the result should be;


 
So let’s start to write our project;


 
In order to implement multiple parsers, we need one interface to return pairs.

public interface IFileParser
 
{
 
Task<Dictionary<string, int>> Parse(
 
stringfilePath,
 
CancellationToken cancellationToken = default);
 
}

Our classic Parser;


 
Because StreamReader has the ReadLine method we will use it. Then we count if there is an occurrence.

public class TextFileParser : IFileParser
 
{
 
public async Task<Dictionary<string, int>> Parse(stringfilePath, CancellationTokencancellationToken=default)
 
{
 
var dic = new Dictionary<string, int>();
 
string line;
 
using (var file = new StreamReader(filePath))
 
{
 
while ((line = await file.ReadLineAsync()) != null)
 
{
 
if (cancellationToken.IsCancellationRequested)
 
{
 
break;
 
}
 
var words = line.Split("").Where
 
(x => !string.IsNullOrWhiteSpace(x));
 
foreach (var word in words)
 
{
 
if (dic.ContainsKey(word))
 
{
 
dic[word] = dic[word] +1;
 
}
 
else
 
{
 
dic[word] =1; }
 
}
 
}
 
}
 

 
return dic;
 
}
 
}

And the second one is here. Here I tried to write code to do the same thing with our classic example. Because there is no ReadLine method that implements Memory buffer I wrote something as if it is ReadLine.

public class TextFileMemoryParser : IFileParser
 
{
 
public async Task<Dictionary<string, int>> Parse(
 
string filePath,
 
CancellationToken cancellationToken = default)
 
{
 
var dic = new Dictionary<string, int>();
 
bool goon = true;
 
string line;
 
var chars = new List<char>();
 
using (var file = new StreamReader(filePath))
 
{
 
Memory<char> memory = new Memory<char>(new char[1]);
 

 
while (goon)
 
{
 
await file.ReadAsync(memory, cancellationToken);
 

 
goon = !file.EndOfStream;
 

 
if (file.EndOfStream)
 
{
 
chars.Add(memory.Span.ToString()[0]);
 
}
 

 
if (file.EndOfStream || memory.Span.Contains('\n') ||
 
memory.Span.Contains('\r'))
 
{
 
line = string.Create(chars.Count, chars, (x, y) =>
 
{
 
for (int i = 0; i < x.Length; i++)
 
{
 
x[i] = y[i];
 
}
 
});
 
foreach (var word in line.Split(" ").Where
 
(x => !string.IsNullOrWhiteSpace(x)))
 
{
 
if (dic.ContainsKey(word))
 
{
 
dic[word] = dic[word] + 1;
 
}
 
else { dic[word] = 1; }
 
}
 
chars.Clear();
 
}
 
else
 
{
 
chars.Add(memory.Span.ToString()[0]);
 
}
 
}
 
}
 
return dic;
 
}
 
}


 
As you can see above we have Memory<char> and we read text file char by char until it is a new line then we create a string by our read chars. Then we do the same thing as we do in our classic example. If we had a split method like Span<string> char[].Span that would be awesome too.

So let's see what their effects are on files.

public async Task OnPostUploadAsync()
 
{
 
if (Upload == null) return;
 

 
_cts = new CancellationTokenSource();
 

 
var file = Path.Combine(
 
_environment.WebRootPath,
 
"uploads",
 
Upload.FileName);
 
using (var fileStream = new FileStream(file, FileMode.Create))
 
{
 
await Upload.CopyToAsync(fileStream, _cts.Token);
 
}
 

 
foreach (var _fileParser in _fileParsers)
 
{
 
Stopwatch sw = Stopwatch.StartNew();
 
WordsWithCount = await _fileParser.Parse(file, _cts.Token);
 
sw.Stop();
 
if (_fileParser is TextFileParser)
 
{
 
DefaultParser = sw.Elapsed.TotalSeconds.ToString();
 
}
 
else if (_fileParser is TextFileMemoryParser)
 
{
 
MemoryParser = sw.Elapsed.TotalSeconds.ToString();
 
}
 
}
 
}


 
After injecting the list of IFileParser we watched them on a higher than 50MB text file and the result is an average of 7 seconds with MemoryParser.


 
Also, if you look at diagnostic tools you can see that CPU usage is less than the classic one.
 
If you try on a file larger than 450MB you can see below that it takes %50 shorter than the classic one. And these results are not got on the released version they get on the debug version.


 
Note: If there is any problem with my codes don’t hesitate to comment about it.

Source: Medium - Murat Can OĞUZHAN

The Tech Platform

www.thetechplatform.com

    0