If a stranger approached you and asked – Would you carry my bag as your luggage on the flight – what would you tell…? I think that with a few exceptions most of us would rather decline. After all taking responsibility for something we have no control over is rather a bad idea.
If you wouldn’t risk that, are you sure you do the same on the web? With online threats we have not yet evolved to catch them with a glimpse of an eye. Maybe you let users share files with others, or a file upload is part of the KYC process. It is not obvious how those features can be abused, but leaving an application open to malicious file uploads can potentially expose a user, app administrator, or even a system. Such a file can contain a program that corrupts data, compromises privacy leading to substantial financial and PR losses.
Protecting against malicious file uploads is a must these days as most applications store and process data files in one way or another. Building a self-driven, automated and accurate pipeline for eliminating those threats is not a simple task, so let’s explore the subject to make informed decisions in regards to file upload scanning.
Goals of scanning
To ensure high resiliency and security of the system it is required to identify channels over which files are going to be distributed. The biggest need is preventing malicious content from being shared – we are looking here for ways in which users are able share their data with others (in team, or publicly). An important part of complete system is shielding system administrators and support teams from malicious file uploads. We should try to limit severity by stopping malicious files from synchronizing across devices as some of them may be more vulnerable than device used to upload the file. The last point on the checklist is making sure we are scanning files that are used by the system.
A most important question is what type of files should be scanned and if there is a safe format I can trust? Short answer is that there are no safe bets and a file can be disguised as a text document and still contain executable content. Even if this is not the case, programs that process those files suffer from their own vulnerabilities – let .JPEG file be an example, where in 2004 opening a specially crafted image on Windows systems resulted in remote code execution.
An upload with malicious content is a ticking bomb waiting to blow up. The sooner it gets discovered and contained the better. That is the reason why all storage containing user uploads needs to be the first stage of an automated scanning. As soon as the file lands there it has to initiate the first scan. Depending on the result it can be flagged as a safe to work, or potential threat.
All suspicious files need to be either isolated or removed from the medium, and the owning user notified about actions taken. As the process of scanning may not immediately yield results such notification is likely to be in the form of an in-app notification, push notification or an email being sent. Depending on the business case there may be a need for an administrator reporting and additional traceability measures to ensure eventual high quality customer support.
If the file remains on the storage for a longer time, there is a need for a periodical scan against known malicious threats. How often those scans should be performed will vary across different applications but should account for factors such as how often the file is downloaded (impact assessment), how big it is and what kind of data it holds.
Finding perfect cost/performance balance
To assess TCO (Total Cost of Ownership) of an automated scanning pipeline a few factors need to be considered:
- file size – is the file size is going to be limited and what is usual file size
- scalability – if number of uploads is predictable, if there is a need to create automatically scaled solution, if there are periods with no uploads at all
- performance – how quickly user needs to be notified about a issue if file is flagged as a threat in process of scanning
- budget – what are budget constraints that may affect range of potential solutions (e.g. antivirus software providers)
To make it clear – there is no need for accurate numbers, but rather for “order of magnitude” estimates that will help in the design process. Scanning big files with a need to instant status report for an unpredictable load will eventually cost more than a few scans on small files with predictable load. The latter one can be achieved with as small a budget of a few dollars a month, but with a proper planning, even big files scanning can be a cost effective solution. Considering potential risk reduction and overall platform experience it is clear that return of investment for such activity is significant.
How does it work in practice?
A learning and employee engagement company Learnamp identified a need for additional security measures and protection, as one of the features enables users to share uploaded files with organisation-wide teams. To provide the highest level of security to its users a solution scanning file uploaded to AWS S3 service has been established.
As soon as the file is uploaded, it is marked as “antivirus scan pending” , preventing users from downloading such file. In the next few seconds a scalable fleet of serverless antivirus scanners picks up and processes the file saving scan results. If the file is safe to use, “download” option is unblocked, otherwise a file is removed and proper notification is issued.
To complement the solution a database of known viruses is refreshed automatically multiple times a day.
By keeping the system “upload driven” and serverless scanners automatically scaling and running only when they are needed, the final cost of solution in times of no upload activity is zero.
Do I need it?
Lack of antivirus scanning on files uploaded by users is very often raised by security testers in penetration test findings. This comes by no surprise as opening an app for file uploads is rarely analysed from a security perspective. There may be many reasons for that – intimidating security aspects being one of them, or too big disruption from hands-on business being another. We think this is not the case. Happy to argue though.
Databricks: How to upload data?
Databricks is a unified data analytics platform for massive-scale data engineering and collaborative data science. It is a powerful tool to serve so many purposes. How difficult it is to use? Fairly easy! How to upload data into it?
CQRS – Another “buzzword” or game changer?
CQRS is not as popular as it seems, especially in real-world applications. Learn how to maintain an application with frequently changing business using a pattern that is easy to understand and implement.