Efficiently Reading Large Files in Java Without Memory Overload
Written on
Chapter 1: Introduction to Processing Large Files
Have you faced the difficulty of writing a program to handle files that surpass your memory capacity? If you've encountered this issue, you likely wish it were simpler to work with files that fit entirely in memory. In such scenarios, using methods from the Files class allows us to load file contents into memory and process them seamlessly with streams.
When confronted with large files that exceed memory limits, however, we must adopt a different strategy: reading the files incrementally and employing additional structures to gather the necessary data.
In this article, we will examine a method to analyze substantial log files in Java that cannot fit into memory.
Scenario
Our goal is to create a program that examines server log files and produces a report highlighting the top 10 most frequently accessed applications. Each day, a new log file is generated, containing details such as timestamps, host information, duration, service calls, and other irrelevant data.
Example log entries might look like this:
2024-02-25T00:00:00.000+GMT host7 492 products 0.0.3 PUT 73.182.150.152 eff0fac5-b997-40a3-87d8-02ff2f397b44
2024-02-25T00:00:00.016+GMT host6 123 logout 2.0.3 GET 34.235.76.94 8b97acae-dd36-4e83-b423-12905a4ab38d
...
Our task is to receive a list of log files and compile a report that includes only those services utilized every day across all provided log files.
Initial Approach
Initially, I considered the business requirements and devised the following code:
public void processFiles(final List fileList) {
final Map<LocalDate, List<String>> fileContent = getFileContent(fileList);
final List<String> serviceList = getServiceList(fileContent);
final List<Statistics> statisticsList = getStatistics(fileContent, serviceList);
final List<TopCalls> topCalls = getTop10(statisticsList);
print(topCalls);
}
This processFiles method takes a list of files as input and performs several tasks:
- Constructs a map for each file, where the key is a LocalDate and the value is a list of lines from the file.
- Creates a list of unique service names from all files.
- Compiles statistics for all services, organizing data into a structured map.
- Filters this data to identify the top 10 service calls.
- Prints the results.
However, this approach risks overloading the memory, which can lead to an OutOfMemoryError.
Solution
To avoid this issue, we will process files line-by-line and create a map where the service name serves as the key, and a Counter object tracks the number of calls and days each service was accessed.
The revised processFiles method looks like this:
private void processFiles(final List<Path> fileList) {
final Map<String, Counter> compiledMap = new HashMap<>();
for (int i = 0; i < fileList.size(); i++) {
processFile(fileList, compiledMap, i);}
final List<TopCalls> topCalls =
compiledMap.values().stream()
.filter(Counter::allDaysSet)
.sorted(Comparator.comparing(Counter::getNumberOfCalls).reversed())
.limit(10)
.toList();
print(topCalls);
}
Let's break down this code:
- We declare a Map (compiledMap) where the key is a String representing the service name, and the value is a Counter object that stores relevant statistics.
- We process each file sequentially and update the compiledMap accordingly.
- Finally, we utilize stream features to filter the counters with data for all days, sort them by the number of calls, and retrieve the top 10.
Understanding the Counter Class
Before diving into the processFile method, it’s essential to understand the Counter class, which plays a key role in our solution:
public class Counter {
@Getter private String serviceName;
@Getter private long numberOfCalls;
private final BitSet daysWithCalls;
public Counter(final String serviceName, final int numberOfDays) {
this.serviceName = serviceName;
this.numberOfCalls = 0L;
daysWithCalls = new BitSet(numberOfDays);
}
public void add() {
numberOfCalls++;}
public void setDay(final int dayNumber) {
daysWithCalls.set(dayNumber);}
public boolean allDaysSet() {
return daysWithCalls.stream()
.mapToObj(daysWithCalls::get)
.reduce(Boolean.TRUE, Boolean::logicalAnd);
}
}
The Counter class consists of three attributes: serviceName, numberOfCalls, and daysWithCalls. The numberOfCalls is incremented via the add method, which is called for each line processed. The daysWithCalls is a BitSet, an efficient data structure for storing boolean values, initialized for the number of days being processed.
The setDay method updates the corresponding bit for a given day, while the allDaysSet method checks if all days have been accounted for.
Processing Each File
The processFile method handles the core processing:
private void processFile(final List<Path> fileList,
final Map<String, Counter> compiledMap,
final int dayNumber) {
try (Stream<String> lineStream = Files.lines(fileList.get(dayNumber))) {
lineStream
.map(this::toLogLine)
.forEach(
logLine -> {
Counter counter = compiledMap.get(logLine.serviceName());
if (counter == null) {
counter = new Counter(logLine.serviceName(), fileList.size());
compiledMap.put(logLine.serviceName(), counter);
}
counter.add();
counter.setDay(dayNumber);
});
} catch (final IOException e) {
throw new RuntimeException(e);}
}
This method reads the file line-by-line using the lines method from the Files class, which processes the file lazily to avoid loading it entirely into memory. Each line is transformed into an object using the toLogLine method, which accesses the log line attributes.
For each log line, we retrieve (or create) the appropriate Counter from the compiledMap associated with the service name and then invoke the add and setDay methods of the Counter.
Conclusion
In summary, handling large files in Java without exhausting memory is quite feasible. The Files class provides the necessary methods for processing files line-by-line, while a hash map helps us store data efficiently during this process.
Learn how to read a massive 10GB CSV file without crashing your system.
Discover how to open extremely large text files using Notepad++ with the BigFiles plugin.