The Potential Role of Computational Linguistics by Author Discrimination in the Development of ‘Ulum Al-Hadith

image credit:

The Potential Role of Computational Linguistics by Author Discrimination in the Development of ‘Ulum Al-Hadith

By: Fajri M. Muhammadin


In 2012, an article titled “Author discrimination between the Holy Quran and Prophet’s statements” written by Prof Halim Sayoud was published in Literary and Linguistic Computing Vol. 27(4) year 2012. This amazing article performed author discrimination by stylometric analysis towards the language of the Qur’an and authentic ahadith.

The results show that there is a big difference between the language style of the Qur’an and authentic ahadith, proving that each have distinct authors. This adds up to the already numerous scientific evidences that Prophet Muhammad ﷺ did not author the Qur’an.

PS: Prof Halim Sayoud has updated that research into a much grander one, which he published on his website.

However, research is but a relay. A researcher can only run as far as he/she can until eventually passing on the baton to the next researcher who will continue the race. This is what researchers have always done throughout generations.

Six years later, 2018, I sent an email to Prof Halim Sayoud. He is a Professor of Electronics and Informatics at the University Of Science And Technology Houari Boumediene, Algeria. I asked him (although what I really meant was “do you think you can do this research? I will want to read it after you do.” Hehe), is it possible to do a similar research but to compare the Qur’an and authentic hadith qudsi? The hadith qudsi contains kalam attributed to Allah.

Alhamdulillah, he replied to my email two days later. He said that such research appears to be impossible. He said that stylometric researches would require very large data sets (thousands), while there are only a very little available hadith qudsi (authentic ones are even less). I was surely disappointed, but what can I do. Nonetheless, I was very happy he took the time to respond to my email.

Four years later in 2022, exactly 14 May. It was amidst the screams of my very tired son who was unable to sleep (we are weaning him) that an epiphany came. One that had nothing to do with the many researches and other duties that I was supposed to be thinking about at the time.

What if we use stylometric research to compare authentic ahadith and fabricated ones? If anyone fabricated a hadith, surely the language style would differ greatly with that of Prophet Muhammad ﷺ. After all, the ‘ulama of hadith say that, among the characteristics of fabricated hadith, was the use of imperfect Arabic language in the matn.

The stumbling block of the previous research idea with hadith qudsi does not exist here. Unfortunately, there are so many fabricated hadith out there. There are even special books dedicated to collect fabricated hadith, such as Kitab Al-Mawdu’at Al-Kubra by Imam Ibn al-Jawzi. There will be an abundance of samples for the dataset.

Successful research would provide clear and scientifically sound indicators that there would be very different styles of language between authentic and fabricated ahadith. In such a case, there maybe some prospects to further develop and utilize this research, inter alia:

  1. It may add to the science of matan critic, which might later be further developed to examine da’if (but not fabricated) hadith,
  2. One may also compare the ahadith between those considered sahih by ahlus sunnah and shi‘a.
  3. Etc

It must be noted that even successful research results will not justify its use as sole determiner of hadith status. No researches like this will ever have a 100% confidence rate, while methodological choices could definitely reduce but never eliminate error factors.

After all, even the established methodologies applied by the muhaddithin throughout the ages would hardly achieve a 100% certainty rate, except for the text of the Qur’an and mutawatir hadith (which are not very numerous). What they can do is to further reduce the error factors by analyzing as many aspects and angles as possible while continuously improving the methodology used. Perhaps computational linguistics could contribute to this effort.

Nonetheless, as far as I can think of, there are a few possible problems:

First, the ‘ulama have differed on many occasions regarding the status of a hadith. Regarding authentic hadith, Prof Halim Sayoud used samples form Kitab Sahih Al-Jami‘ or famously known as Sahih Al-Bukhari. There is little to no controversy regarding the authenticity of its content. However, more problems will show regarding the fabricated hadith. There ‘ulama sometimes differ on whether a narration is fabricated or ‘only’ very weak, for example. Even Ibn al-Jawzi’s kitab Al-Mawdu’at al-Kubra is not free from critic. These different rulings regarding hadith status would have different consequences. Hence, such a research would need to be careful in setting the parameter to identify fabricated hadith for its dataset.

Second, some narrations are ruled as fabricated not because of matan fabrication. It is possible that a fabricator actually did on occasions narrate authentic hadith, but all of his reports are rejected because he has made fabrications on other occasions. Another possibility, the matan is not fabricated but the sanad is. Perhaps things like this are why we need very large datasets as sample.

Third, the fabricated ahadith do not come from a single author. There are so many hadith fabricators. In Prof Halim Sayoud’s previous research, he compared two datasets which each has one author. In this research I am thinking about, the Sahih ahadith surely has one author i.e. Prophet Muhammad ﷺ. Meanwhile, the dataset of fabricated ahadith would have many authors. It would be wrong to analyze samples from numerous authors, conclude a style characteristic, and treat it as if it is one author. Meanwhile, each of those hadith fabricators will have different styles from each other. Is there any way to work around this problem?

Before writing this, I have just sent an email to Prof Halim Sayoud to ask what he thinks of this idea. I will update this post when (if) he replies, insha’Allah. UPDATE: He Has Replied to My Email. So below is what he said, and afterwards I will share my thoughts:


Assalam Alaikom Fajri

Sorry for the delay…

Firstly, Thank you for the proposal, it’s very interesting indeed.

Everything one can do for the guidance of humanity is good.

As you know, previous stylometric analysis of the holy Quran and Hadith showed that the two books come from two Authors, and then the Quran cannot be an invention of the Prophet PBUH. As a scientific discovery, this may confirm the authenticity of the holy book somehow.

Now, concerning your proposed idea, I think it could be applied in specific conditions.

So, if you try to use stylometry to check the authenticity of a Hadith, it should be very difficult – Let’s take an example, suppose one want to analyze the following Hadith “صوموا تصحُّوا”, which is composed of only 2 words. In this case this 2-words Hadith doesn’t have enough information to compare to the reference database of the authentic Hadith. In fact, a fair stylometric analysis require about 2500 words.

On the other hand if you try to check the authenticity of a consistent dataset of Hadiths, it should be possible. That is, if you have, for instance, 100 Hadiths that you merge together to produce a big text of about say 1000 words, there is a fair possibility to check whether the investigated dataset is genuine or false.

In that context, I tried to conduct such a test with datasets of about 1030 words per document. Results were interesting, since the fabricated documents were automatically identified as false (i.e. not belonging to the Hadith Author).

That is, I hope you appreciated this discussion and I wish you much success in your professional life…

Best regards



Before anything else, I am very thankful that he did not only answer my question hypothetically but also took the time to run some tests.

The highlight is that the stylometry test can only be done with relatively big datasets, i.e. around 2500 words (Im sure more words would produce stronger results). As he indicated in the email, merging multiple hadith to meet that necessary amount of words is possible.

I have asked him what will happen if the dataset contains a mix of authentic and fabricated ahadith (50:50) to be compared with a second dataset of 100% authentic ahadith. His response is as follows:


Dear Fajri

In this case you will have some features similar to the reference Hadith and some different, obviously.

Sincerely, I don’t know really what could be the result, but it will probably lead to a borderline decision: not authentic and not too different.

For instance, suppose that a book is written by 2 authors X and Z and is segmented into several segments.

As you can see in the following drawing, if a text segment contains both text from X and Z, then it may be classified as a group Y, which is between them:

XX X XXX XX                        Y                     ZZ ZZZ ZZZ

That is, the document Y is classified quite far from X and from Z too.

Such conditions are misleading, but could bring some information.

Hope I responded to your question…



This is interesting, so it is possible that the results of these tests are not merely black and white YES vs NO but may also show gray areas where there are some similarities. But this should be one among many things considered by the researcher, especially because some hadith forgers (as explained earlier) take authentic ahadith and make up an isnad. Or, a book may contain a mix of multiple ahadith with various grades of authenticity (or lack thereof).

So what seems to be the most apparent potential use of this method? Considering (a) the need for big datasets and (b) the third problem above concerning multiple fabricated narrators.

Perhaps I can think of at least three, to further clarify the previous 2 possible benefits I have already mentioned earlier (I am not deleting the previous list, because they are still relevant):

  1. Investigating Hadith Compilations and/or Their Gradings: it is true that many hadith compilations (e.g. Sunan Tirmidhi and Sunan Abi Dawud) were not intended to be a full authentic compilation. However, scholars have tried making takhrij towards these ahadith in the sunans or others compilations and did their gradings (e.g. Shaykh Nassirudin Al-Albani). It is possible to try extract all ahadiths from the sunans authenticated by Shaykh Al-Albani (or Shaykh Shu’ayb al-Arnawt whoever) and compare them with Sahih Al-Bukhari with stylometry test to see what it looks like. We can do the same with Shi’ah books of ahadith, such as Al-Kulani’s Al-Kafi (of course it will require sorting out only narrations attributed to Prophet Muhammad ﷺ and excluding those attributed to the alleged Infailable Imams).
  2. Investigating Certain Hadith Narrators: It might be fruitful to begin by assessing narrations narrated by known fabricators, one fabricator at a time. One must choose individual fabricators that have narrated many (fabricated) narrations, such as Jabir ibn Yazid Al-Ju’fi and Abu al-Mufaddal al-Shaybani, both Rafidis who have fabricated so many narrations (thanks to Ustaz Tommi Marsetio and Ustaz Abdullah Al-Rabbat for telling me this). Then a stylometry test is done to compare Jabir’s fabrications with authentic ahadith. If the results are successful, maybe we can use this method to assess ambiguous (da’if) narrators as additional consideration for purpose of jarh wa ta’dil. The limitation to this approach is that it can only test narrators with many narrations.
  3. Investigating Sahabah Accuracy: we can compare authentic ahadith narrated from one sahabah, with other statements authentically attributed to that sahabah. This is to assist in examining the extent to which paraphrasing might have been done by the sahabah in narrating ahadith. Bearing in mind, of course, that there are some ahadith reported by multiple sahabah in the exact same wording (but there are also some ahadith reported by different sahabah but with some differences in wording.


Unfortunately, I am unable to do this research myself as it is not my field. Insha’Allah I am doing many researches in my own field that intersects with the Islamic sciences, but not this one. I pray that Prof Halim would be inspired to make this research, or at least give constructive feedback to this idea.

I really hope that other Muslim experts on computational linguistics and hadith experts would take this idea and execute it.